The Cross-Entropy 
Method 

A Unified Approach to Combinatorial 
Optimization, Monte-Carlo Simulation, 
and Machine Learning 



Reuven Y. Rubinstein 
Dirk P. Kroese 



Information 
Science 
& Statistics 




Information Science and Statistics 



Series Editors: 
M. Jordan 
J. Kleinberg 
B. Scholkopf 
F.P. Kelly 
I. Witten 




Information Science and Statistics 



Rubinstein/Kroese: The Cross-Entropy Method: A Unified Approach to Combinatorial 
Optimization, Monte-Carlo Simulation, and Machine Learning. 





Reuven Y. Rubinstein Dirk P. Kroese 



The Cross-Entropy Method 

A Unified Approach to Combinatorial 
Optimization, Monte-Carlo Simulation, 
and Machine Learning 



With 38 Illustrations 



^ springer 




Reuven Y. Rubinstein 
Department of Industrial Engineering 
and Management 
Technion 
Technion City 
Haifa 32 000 
Israel 

ierrrOl @ ie.technion.ac.il 



Dirk P. Kroese 
Department of Mathematics 
University of Queensland 
Brisbane 
Queensland 4072 
Australia 

kroese@maths.uq.edu.au 



Series Editors: 

Michael Jordan 
Division of Computer Science 
and Department of Statistics 
University of California, 

Berkeley 

Berkeley, CA 94720 
USA 

Frank P. Kelly 
Statistical Laboratory 
Centre for Mathematical Sciences 
Wilberforce Road 
Cambridge CB3 OWB 
UK 

Library of Congress Cataloging-in-Publication Data 
Rubinstein, Reuven Y. 

The cross-entropy method : a unified approach to combinatorial optimization, 

Monte-Carlo simulation, and machine learning / Reuven Y. Rubinstein, Dirk P. Kroese. 
p. cm. — (Information in sciences and statistics) 

Includes bibliographical references and index. 

ISBN 978-1-4419-1940-3 ISBN 978-1-4757-4321-0 (eBook) 

DOI 10.1007/978-1-4757-4321-0 

1. Cross-entropy method. 2. Combinatorial optimization. 3. Monte-Carlo simulation. 4. 
Machine learning. I. Kroese, Dirk P. n. Title. III. Series. 

QA402.5.R815 2004 

519.6'4— dc22 2004048202 

ISBN 978-1-4419-1 940-3 Printed on acid-free paper. 

© 2004 Springer Science+Business Media Ne\v York 

Originally published by Springer Science+Business Media, Inc. in 2004 

Softcover reprint of tiie hardcover 1st edition 2004 

All rights reserved. This work may not be translated or copied in whole or in part without the 
written permission of the publisher. Springer Science+Business Media, LLC, 
except for brief excerpts in connection with re- 
views or scholarly analysis. Use in connection with any form of information storage and retrieval, 
electronic adaptation, computer software, or by similar or dissimilar methodology now known or 
hereafter developed is forbidden. 

The use in this publication of trade names, trademarks, service marks and similar terms, even if 
they are not identified as such, is not to be taken as an expression of opinion as to whether or not 
they are subject to proprietary rights. 



Jon Kleinberg 

Department of Computer Science 
Cornell University 
Ithaca, NY 14853 
USA 



Bernhard Scholkopf 
Max Planck Institute for 
Biological Cybernetics 
Spemannstrasse 38 
72076 Tubingen 
Germany 



Ian Witten 

Department of Computer Science 
University of Waikato 
Hamilton 
New Zealand 



987654321 SPIN 10966174 



springeronline.com 




To the memory of my teaxiher and friend Leonard Rastrigin 

Reuven Rubinstein 



For Lesley, Elise, and Jessica 
Dirk Kroese 




Preface 



This book is a comprehensive and accessible introduction to the cross-entropy 
(CE) method. The CE method started life around 1997 when the first author 
proposed an adaptive algorithm for rare-event simulation using a cross-entropy 
minimization technique. It was soon realized that the underlying ideas had a 
much wider range of application than just in rare-event simulation; they could 
be readily adapted to tackle quite general combinatorial and multi-extremal 
optimization problems, including many problems associated with the field of 
learning algorithms and neural computation. 

The book is based on an advanced undergraduate course on the CE 
method, given at the Israel Institute of Technology (Technion) for the last 
three years. It is aimed at a broad audience of engineers, computer scientists, 
mathematicians, statisticians and in general anyone, theorist or practitioner, 
who is interested in smart simulation, fast optimization, learning algorithms, 
image processing, etc. Our aim was to write a book on the CE method which 
was accessible to advanced undergraduate students and engineers who simply 
want to apply the CE method in their work, while at the same time accentu- 
ating the unifying and novel mathematical ideas behind the CE method, so 
as to stimulate further research at a postgraduate level. 

The emphasis in this book is placed on concepts rather than on mathemat- 
ical completeness. We assume that the reader has some basic mathematical 
background, such as a basic undergraduate course in probability and statis- 
tics. We have deliberately tried to avoid the formal “definition - lemma - 
theorem - proof” style of many mathematics books. Instead we embed most 
definitions in the text and introduce and explain various concepts via exam- 
ples and experiments. In short, our goal is to promote a new unified way of 
thinking on the connection between rare events in simulation and optimiza- 
tion of complex systems in general, rather than burden the reader with too 
much technical detail. 

Most of the combinatorial and continuous multi-extremal optimization 
case studies in this book are benchmark problems taken from the World Wide 
Web, and CE was compared with the best known solutions. In all examples 




viii Preface 



tested so far the relative error of CE was within the limits of 1-2% of the 
best known solution. For some instances CE produced even more accurate 
solutions. It is crucial to emphasize that for the “noisy” counterparts of these 
test problems, which are obtained by adding random noise to the objective 
function, CE still performs quite accurately, provided the sample size is in- 
creased accordingly. Since our extensive numerical experience with different 
case studies suggests that CE is quite reliable (by comparing it with the best 
known solution) , we purposely avoided comparing it with other heuristics such 
as simulated annealing and genetic algorithms. This of course does not imply 
that one will not find a problem where CE performs poorly and therefore 
will be less accurate than some other methods. However, when such problems 
do occur, our FACE (fully adaptive CE) algorithm (see Chapter 5) should 
identify it reliably. 

Chapter 1 starts with some background on the cross-entropy method. It 
provides a summary of mathematical definitions and concepts relevant for 
this book, including a short review of various terms and ideas in probability, 
statistics, information theory, and modern simulation. A good self-contained 
entry point to this book is the tutorial Chapter 2, which provides a gradual 
introduction to the CE method, and shows immediately the elegance and 
versatility of the CE algorithm. In Chapter 3 we discuss the state of the 
art in efficient simulation and adaptive importance sampling using the CE 
concept. Chapter 4 deals with CE optimization techniques, with particular 
emphasis on combinatorial optimization problems. In Chapter 5 we apply 
CE to continuous optimization problems, and give various modifications and 
enhancements of the basic CE algorithm. The contemporary subject of noisy 
(stochastic) optimization is discussed in Chapter 6. Due to its versatility, 
tractability, and simplicity, the CE method has great potential for a diverse 
range of new applications, for example in the fields of computational biology, 
graph theory, and scheduling. Various applications, including DNA sequence 
alignment, are given in Chapter 7. A connection between the CE method and 
machine learning — specifically with regard to optimization — is presented 
in Chapter 8. A wide range of exercises is provided at the end of each chapter. 
Difficult exercises are marked with a * sign. Finally, example CE programs, 
written in Matlab, are given in the appendix. 

Acknowledgement. We thank all of those who contributed to this work. Sergey 
Porotsky provided a thorough review of an earlier version of this book and sup- 
plied an effective modification to the basic CE algorithm for continuous optimiza- 
tion. Arkadii Nemirovskii and Leonid Mytnik made a valuable contribution to the 
convergence proofs of the CE method. S0ren Asmussen, Pieter-Tjerk the Boer, Pe- 
ter Glynn, Tito Homem-de-Melo, Jonathan Keith, Shie Manner, Phil Pollett, Ad 
Bidder, and Perwez Shahabuddin read and commented upon sections of this book. 

We are especially grateful to the many undergraduate and graduate students at 
the Technion and the University of Queensland, who helped make this book pos- 
sible, and whose valuable ideas and experiments where extremely encouraging and 




Preface 



IX 



motivating. Yohai Gat, Uri Dubin, Leonid Margolin, Levon Kikinian, and Alexander 
Podgaetsky ran CE on a huge variety of combinatorial and neural-based problems. 
Rostislav Man ran extensive experiments on heavy-tailed distributions and supplied 
useful convergence results. He also provided the Matlab programs in Appendix A.6. 
Ron Zohar provided the second generation algorithm for the buffer allocation prob- 
lem. Joshua Ross helped establish the material in Appendix 3.13. Sho Nariai pointed 
out various errors in an earlier version. Zdravko Botev conscientiously went through 
and checked the exercises. Thomas Taimre provided various Matlab programs in the 
appendix and meticulously carried out the clustering experiments. This book was 
supported by the Israel Science Foundation (grant no. 191-565). 



Haifa and Brisbane, 

January 2004 

Reuven Y. Rubinstein 
Technion 

Dirk P. Kroese 
The University of Queensland 




Acronyms 



AGO 


Ant Colony Optimization 


ARM 


Acceptance Rejection Method 


ASP 


Associated Stochastic Problem 


BAP 


Buffer Allocation Problem 


cdf 


cumulative distribution function 


CE 


Cross-Entropy 


CLT 


Central Limit Theorem 


CMC 


Crude Monte Carlo 


CoM 


Change of Measure 


COP 


Combinatorial Optimization Problem 


CVRP 


Capacitated Vehicle Routing Problem 


DES 


Discrete Event System 


DP 


Dynamic Programming 


ECM 


Exponential Change of Measure 


FACE 


Fully Adaptive Cross-Entropy 


FKM 


Fuzzy K-Means 


i.i.d. 


independent identically distributed 


ITLR 


Inverse Transform Likelihood Ratio 


IS 


Importance Sampling 


LPP 


Longest Path Problem 


LR 


Likelihood Ratio 


LVQ 


Linear Vector Quantization 


max-cut 


Maximal Cut 




xii Acronyms 



MC 


Maximum Clique 


MCMC 


Markov Chain Monte Carlo 


MCWR 


Markov Chain With Replacement 


MDP 


Markov Decision Process 


MLE 


Maximum Likelihood Estimate (or Estimator) 


MME 


Method of Moments Estimate (or Estimator) 


NEF 


Natural Exponential Family 


NOP 


Noisy Optimization Problem 


ODTM 


Optimal Degenerate Transition Matrix 


pdf 


probability density function 


pmf 


probability mass function 


PFSP 


Permutation Flow Shop Problem 


QAP 


Quadratic Assignment Problem 


RE 


Relative Error 


RL 


Reinforcement Learning 


SA 


Stochastic Approximation 


SF 


Score Function 


SCV 


Squared Coefficient of Variation 


SEN 


Stochastic Edge Network 


SMCDDP 


Single Machine Common Due Date Problem 


SMTWTP 


Single Machine Total Weighted Tardiness Problem 


SNN 


Stochastic Node Network 


SPP 


Shortest Path Problem 


TL 


Table Look 


TSP 


Traveling Salesman Problem 


VM 


Variance Minimization 


VRP 


Vehicle Routing Problem 


WIP 


Work In Process 




List of Symbols 



» 



V 

V2 

E 

N 

P 

R 

R” 

3 

£ 

M 

§ 

Ber 

Beta 

Bin 

DExp 



much greater than 

is distributed according to 

approximately 

V/ is the gradient of / 

V^/ is the Hessian of / 

expectation 

set of natural numbers {0, 1, . . .} 
probability measure 

the real line = 1-dimensional Euclidean space 
n-dimensional Euclidean space 

Kullback-Leibler cross-entropy 
Shannon entropy 
Fisher information 
likelihood function 
mutual information 
score function 

Bernoulli distribution 
beta distribution 
binomial distribution 
double exponential distribution 




xiv Symbols 

DU discrete uniform distribution 

Exp Exponential distribution 

Gam gamma distribution 

G geometric distribution 

N normal or Gaussian distribution 

Pareto Pareto distribution 

Po Poisson distribution 

SExp shifted exponential distribution 

U uniform distribution 

Weib Weibull distribution 



a smoothing parameter 

7 level parameter 

e relative experimental error 

'ip parameterization; 6 = 'ip{u) 

Q cumulant function (log of moment generating function) 

Q rarity parameter 

D{y) Objective function for CE minimization 

/ probability density function 

I A indicator function of event A 

In (natural) logarithm 

N sample size 

o Little-o order symbol 

0 Big-0 order symbol 

S performance function 

5(i) i-th order statistic 

u nominal reference parameter (vector) 

V reference parameter (vector) 

V estimated reference parameter 

V* CE optimal reference parameter 

*v VM optimal reference parameter 

V (v) Objective function for VM minimization 

W likelihood ratio 




Symbols xv 



x,y 

X,Y 

x,y 



vectors 

random vectors/matrices 
sets 




Contents 



Acronyms xi 

List of Symbols xiii 

1 Preliminaries 1 

1.1 Motivation for the Cross-Entropy Method 1 

1.2 Random Experiments and Probability Distributions 2 

1.3 Exponential Families 6 

1.4 Efficiency of Estimators 8 

1.5 Information 10 

1.6 The Score Function Method * 16 

1.7 Generating Random Variables 19 

1.7.1 The Inverse-Transform Method 19 

1.7.2 Alias Method 21 

1.7.3 The Composition Method 21 

1.7.4 The Acceptance-Rejection Method 22 

1.8 Exercises 24 

2 A Tutorial Introduction to the Cross-Entropy Method 29 

2.1 Introduction 29 

2.2 Methodology: Two Examples 31 

2.2.1 A Rare-Event Simulation Example 31 

2.2.2 A Combinatorial Optimization Example 34 

2.3 The CE Method for Rare-Event Simulation 36 

2.4 The CE-Method for Combinatorial Optimization 41 

2.5 Two Applications 45 

2.5.1 The Max-Cut Problem 46 

2.5.2 The Travelling Salesman Problem 51 

2.6 Exercises 57 




Contents 



Efficient Simulation via Cross-Entropy 59 

3.1 Introduction 59 

3.2 Importance Sampling 62 

3.3 Kullback-Leibler Cross-Entropy 67 

3.4 Estimation of Rare-Event Probabilities 72 

3.5 Examples 75 

3.6 Convergence Issues 83 

3.7 The Root-Finding Problem 90 

3.8 The TLR Method 91 

3.9 Equivalence Between SLR and TLR 96 

3.10 Stationary Waiting Time of the GI/G/1 Queue 100 

3.11 Big-Step CE Algorithm 104 

3.12 Numerical Results 105 

3.12.1 Sum of Weibull Random Variables 106 

3.12.2 Sum of Pareto Random Variables 108 

3.12.3 Stochastic Shortest Path 109 

3.12.4 GI/G/1 Queue Ill 

3.12.5 Two Non-Markovian Queues with Feedback 116 

3.13 Appendix: The Sum of Two Weibulls 118 

3.14 Exercises 121 

Combinatorial Optimization via Cross-Entropy 129 

4.1 Introduction 129 

4.2 The Main CE Algorithm for Optimization 132 

4.3 Adaptive Parameter Updating in Stochastic Node Networks . . 136 

4.3.1 Conditional Sampling 137 

4.3.2 Degenerate Distribution 138 

4.4 Adaptive Parameter Updating in Stochastic Edge Networks . . . 138 

4.4.1 Parameter Updating for Markov Chains 139 

4.4.2 Conditional Sampling 140 

4.4.3 Optimal Degenerate Transition Matrix 140 

4.5 The Max-Cut Problem 140 

4.6 The Partition Problem 145 

4.7 The Travelling Salesman Problem 147 

4.8 The Quadratic Assignment Problem 154 

4.9 Numerical Results for SNNs 156 

4.9.1 Synthetic Max-Cut Problem 157 

4.9.2 r-Partition 160 

4.9.3 Multiple Solutions 162 

4.9.4 Empirical Computational Complexity 168 

4.10 Numerical Results for SENs 168 

4.10.1 Synthetic TSP 169 

4.10.2 Multiple Solutions 170 

4.10.3 Experiments with Sparse Graphs 171 

4.10.4 Numerical Results for the QAP 173 




Contents 



XIX 



4.11 Appendices 174 

4.11.1 Two Tour Generation Algorithm for the TSP 174 

4.11.2 Speeding up Trajectory Generation 177 

4.11.3 An Analysis of Algorithm 4.5.2 for a Partition Problem 179 

4.11.4 Convergence of CE Using the Best Sample 181 

4.12 Exercises 185 

5 Continuous Optimization and Modifications 187 

5.1 Continuous Multi-Extremal Optimization 187 

5.2 Alternative Reward Functions 190 

5.3 Fully Adaptive CE Algorithm 191 

5.4 Numerical Results for Continuous Optimization 194 

5.5 Numerical Results for FACE 196 

5.6 Exercises 200 

6 Noisy Optimization with CE 203 

6.1 Introduction 203 

6.2 The CE Algorithm for Noisy Optimization 204 

6.3 Optimal Buffer Allocation in a Simulation-Based 

Environment [9] 207 

6.4 Numerical results for the BAP 213 

6.5 Numerical Results for the Noisy TSP 216 

6.6 Exercises 225 

7 Applications of CE to COPs 227 

7.1 Sequence Alignment 227 

7.2 Single Machine Total Weighted Tardiness Problem 234 

7.3 Single Machine Common Due Date Problem 237 

7.4 Capacitated Vehicle Routing Problem 238 

7.5 The Clique Problem 240 

7.6 Exercises 247 

8 Applications of CE to Machine Learning 251 

8.1 Mastermind Game 251 

8.1.1 Numerical Results 253 

8.2 The Markov Decision Process and Reinforcement Learning. . . . 254 

8.2.1 Policy Learning via the CE Method 256 

8.2.2 Numerical Results 259 

8.3 Clustering and Vector Quantization 260 

8.3.1 Numerical Results 263 

8.4 Exercises 268 




XX 



Contents 



A Example Programs 271 

A.l Rare Event Simulation 272 

A. 2 The Max-Cut Problem 274 

A. 3 Continuous Optimization via the Normal Distribution 276 

A.4 FACE 277 

A. 5 Rosenbrock 281 

A.6 Beta Updating 283 

A.7 Banana Data 285 

References 287 

Index 297 




1 



Preliminaries 



The purpose of this preparatory chapter is to provide some initial background 
for this book, including a summary of the relevant definitions and concepts 
in probability, statistics, simulation and information theory. Readers familiar 
with these ideas may wish to skip through this chapter and proceed to the 
tutorial on the cross-entropy method in Chapter 2. 



1.1 Motivation for the Cross-Entropy Method 

Many everyday tasks in operations research involve solving complicated op- 
timization problems. The travelling salesman problem (TSP), the quadratic 
assignment problem (QAP) and the max-cut problem are a representative 
sample of combinatorial optimization problems (COP) where the problem 
being studied is completely known and static. In contrast, the buffer alloca- 
tion problem (BAP) is a noisy estimation problem where the objective func- 
tion needs to be estimated since it is unknown. Discrete event simulation is 
one method for estimating an unknown objective function. An increasingly 
popular approach is to tackle these problems via stochastic (or randomized) 
algorithms, specifically via the simulated cross-entropy or simply the cross- 
entropy (CE) method pioneered in [145]. The CE method derives its name 
from the cross-entropy (or Kullback-Leibler) distance, which is a fundamental 
concept of modern information theory [39, 95]. The method was motivated 
by an adaptive algorithm for estimating probabilities of rare events in com- 
plex stochastic networks [144], which involves variance minimization. It was 
soon realized [145, 146] that a simple modification of [144], involving CE mini- 
mization, could be used not only for estimating probabilities of rare events but 
for solving difficult combinatorial optimization and continuous multi-extremal 
problems as well. This is done by translating the original “deterministic” op- 
timization problem into a related “stochastic” estimation problem and then 
applying the rare-event simulation machinery similar to [144]. In a nutshell. 




2 



1 Preliminaries 



the CE method involves an iterative procedure where each iteration can be 
broken down into two phases: 

1. Generate a random data sample (trajectories, vectors, etc.) according to 
a specified mechanism. 

2. Update the parameters of the random mechanism based on the data to 
produce a “better” sample in the next iteration. 

The power and generality of this new method lie in the fact that the up- 
dating rules are often simple and explicit (and hence fast) , and are “optimal” 
in some well-defined mathematical sense. Moreover, the CE method provides 
a unifying approach to simulation and optimization and has great potential 
for opening up new frontiers in these areas. 

An increasing number of applications is being found for the CE method. 
Recent publications on applications of the CE method include: buffer alloca- 
tion [9]; static simulation models [85]; queueing models of telecommunication 
systems [43, 46]; neural computation [50]; control and navigation [83]; DNA se- 
quence alignment [98]; signal processing [110]; scheduling [113]; vehicle routing 
[36]; reinforcement learning [117]; project management [37]; heavy-tail distri- 
butions [15, 103]; CE convergence [114]; network reliability [87]; repairable 
systems [137]; and max-cut and bipartition problems [147]. 

It is important to note that the CE method can deal successfully with 
both deterministic and noisy problems. In the later case it is assumed that 
the objective function is “corrupted” with an additive “noise.” Such situation 
typically occurs in simulation-based problems, for example while solving the 
buffer allocation problem [9]. 

The CE home page, featuring many links, articles, references, tutorials and 
computer programs on CE, can be found at 

http : //www . cemethod . org 



1.2 Random Experiments and Probability Distributions 

This section will review some important concepts in probability theory. For 
more details we refer to standard texts such as [55] and [73]. 

Probability theory is about modeling and analyzing random experiments: 
experiments whose outcome cannot be determined in advance, but are nev- 
ertheless still subject to analysis. Mathematically, we can model a random 
experiment by specifying an appropriate probability space (i7, T, P) where fi 
is the set of all possible outcomes of the experiment (the sample space), T the 
collection of all events, and P the probability (measure). However, in practice 
the probability space only plays a “back-stage” role; often the most conve- 
nient way to describe random experiments is by introducing random vari- 
ables, which are usually represented by capital letters such as X, Y, and Z. 
One could think of these as the measurements that will be obtained if we 




1.2 Random Experiments and Probability Distributions 



3 



carry out our experiment tomorrow. A vector X = (Xi, . . . ,Xn) of random 
variables is often called a random vector. A collection {Xt^t G T} of random 
variables indexed with some index set T is called a stochastic process. 

In this book we have kept the measure theoretic foundations of proba- 
bility to a minimum. The only notable “concession” is in the definition of 
probability densities. In particular, it is customary in elementary probability 
to make the distinction between discrete and continuous random variables, 
whose distributions are, in turn, specified by the probability mass function 
(pmf) and the probability density function (pdf) of X. However, this creates 
a great deal of redundancy: every idea and definition is repeated for both 
the discrete and continuous case. A more elegant and unifying approach is to 
define the probability distribution of a random variable X as the measure u 
defined by 

u{B) =F{X eB), BeB, 

where B is the collection of Borel sets in R (containing, for example, countable 
unions of intervals). If X is a discrete random variable with pmf /, then 

’ 

B 

and if X is a continuous random variable with pdf /, then 

= [ fix) dx . 

Jb 

In both cases / is the density of with respect to some base measure /x, that 
is 

u{B) = [ f{x) p{dx) , B eB . 

JB 

In the discrete case / is the density of u with respect to the counting measure, 
in the continuous case / is the density of u with respect to the Lebesgue 
measure on R. Prom now on we will call /, in both the discrete and continuous 
case, the pdf or density of X. The function F defined by 

F{x)= f f{x)p{dx)=F{X ^x) 

J — oo 

is the well-known (cumulative) distribution function (cdf) of X. The expecta- 
tion oi g{X) for some (measurable) function g can now be written as 

= J 9 {x)i'idx) = J gix)f{x)fi{dx) = J g{x)dFix) . 

Tables 1.1 and 1.2 list a number of important continuous and discrete 
distributions, to which we will refer throughout this book. 




4 



1 Preliminaries 



We will use the notation X~/, X~ForX~ Dist to signify that X has 
a pdf /, cdf F or a distribution Dist. Note that in Table 1.1, F is the gamma 
function: _ 

POO 

r{a) = / dx , a > 0 . 

Jo 



Table 1.1. Commonly used continuous distributions. 



Name 


Notation 


fix) 


X e 


Params. 


Uniform 


U[a,6] 


1 

b — a 


[a, 6] 


a <b 


Normal 


N(a, b^) 


1 \ ( x — a \2 

— _e b ) 

b 'J’i/K 


K 


b>0 


Gamma 


Gam(a, 6) 


r(a) 


M+ 


a, 6 > 0 


Exponential 


Exp(a) 


^ —ax 

ae 


K+ 


a > 0 


Double 

exponential 


DExp(a, b) 


- g-b|a:-a| 


R 


6>0 


Shifted 

exponential 


SExp(a, b) 


1 

1 


[a, oo) 


b>0 


Truncated 

exponential 


TExp(a, b) 


ae 

1 - e-“*’ 


[0,b] 


a, 6 > 0 


Beta 


Beta (a, b) 


r{a+b) a-lQ a; 

r{a)r{b) * ^ ^ 


[0,1] 


a,b> 0 


Weibull 


Weib(a, b) 




R-(- 


a, 6 > 0 


Pareto 


Pareto(a, b) 




R+ 


a, 6 > 0 



Light- and Heavy- Tail Distributions 

In Chapter 3 we will deal extensively with heavy- and light-tail distributions 
in the context of rare-event simulation. A random variable X with distribution 
function F is said to have a light-tail distribution if the moment generation 
function 5 E e®^ is finite for some s > 0. That is, 

Ee®^ < c < oo . (1.1) 



Since for every x 



^ Ee*^/{x>x} ^ e"* P(X > x) , 




1.2 Random Experiments and Probability Distributions 



5 



Table 1.2. Commonly used discrete distributions. 



Name 


Notation 


/(*) 


X G 


Params. 


Bernoulli 


Ber(p) 




{0,1} 


o^P< 1 


Binomial 


Bin(n,p) 




{0,1,..., n> 


0<p< 1, 

n e N 


Discrete 

uniform 


DU{l,...,n} 


1 

n 


{l,...,n} 


n € {1,2,.. 


Geometric 


G(p) 




{1,2,...} 


0<P<1 


Poisson 


Po(a) 


-a(F 


{0,1,...} 


a > 0 



it follows that for any s > 0 and c satisfying (1.1) the following inequality 
holds: 

P(X >x)^c . 

In other words, if X has a light tail, then F{x) = 1 — F{x) decays at an 
exponential rate or faster. Examples of light-tailed distributions are the expo- 
nential, normal and Weibull distribution — the latter with increasing failure 
rate, that is F{x) = e“^ with a ^ 1 — and the geometric, Poisson and any 
distribution with bounded support. 

When Ee^^ = oo for all s > 0, X is said to have a heavy-tail distribu- 
tion. Examples of heavy-tail distributions are the Weibull distribution with 
decreasing failure rate {F{x) = , a < 1), the log-normal {X = e^, with 

Y ~ N(a, 6^)) and any regularly varying distribution: F{x) = L(x)/x", where 
L{tx)/L{x) — > 1 as X oo for alH > 0, such as the Pareto distribution. 

A particularly important class of heavy-tailed distributions is the class of 
subexponential distributions. A distribution with cdf F on (0, oo) is said to 
be subexponential if, with Xi, X 2 , . . . , X^ a random sample from F, we have 



lim 

x—^00 



P(Xi + • • • + Xfi > x) 
P(Xi > x) 



= n , 



( 1 . 2 ) 



for all n. Examples are the Pareto and Log-normal distribution and the 
Weibull with decreasing failure rate. See [54] for additional properties of this 
class of distributions. 



Unbounded and Finite Support Distributions 

In this book we classify distributions into two categories: the ones with un- 
bounded support, like the exponential, Poisson and normal distributions, and 
the ones with bounded support, such as uniform U(a, b), truncated exponential 




6 



1 Preliminaries 



and discrete n-point distributions. More formally, we say that a random vari- 
able X has bounded support if P(|X| > a) = 0 for some a large enough. Note 
that bounded support distributions can be viewed as zero tail distributions 
as compared to their counterparts with infinite tail, which belong to the cat- 
egory of either light- or heavy-tdiil distributions; see above and [148]. Notice 
also that, in particular, finite support distribution, (only a finite number of 
values can be attained) have zero tail. 



1.3 Exponential Families 

Exponential families play an important role in statistics. Let X be a random 
variable or vector (in this section vectors will always be interpreted as column 
vectors) with pdf /(x;0) (with respect to some base measure /i), where 0 = 
(01, ... ,0m) ^ is an m-dimensional parameter vector. X is said to belong to 
an m-parameter exponential family if there exist real- valued functions ti (x) 
and h(x) > 0 and a (normalizing) function c{0) > 0, such that 

/(x;0) = c(0)e®*W/i(x), (1.3) 

where t(x) = (ti(x), . . . , tm(x))^ and 0-t(x) is the inner product 0i^i(x). 
The representation of an exponential family is in general not unique. 

Table 1.3 displays the functions c(0), tk{x) and h{x) for several commonly 
used distributions (a dash means that the corresponding value is not used). 



Table 1.3. The functions c(0), tk{x) and h{x) for commonly used distributions. 
Distr. EX ti{x), t 2 {x) c(0) 0i, 02 h(x) 



Gam(a, b) 


a 

b 


X, Inx 


b^ 

r{a) 


—6, a 


- 1 


1 


N{a,b^) 


a 


X, x^ 


e-my 

by/2n 


a 


1 

262 


1 


Weib(a, b) 




Inx, —x^ 


ab^ 


a — 1, 


6“ 


1 


Bin(n,p) 


np 


X, - 


(1-p)" 




). - 










VI -Py 


1 




Po(a) 


a 


X, - 


e-“ 


In a, 





1 


1 

P 










a:! 


G(p) 


X — 1, — 


P 


ln(l -P)> 


- 


1 



Consider in particular the 1-dimensional case where (dropping the bold- 
face font) X is a random variable, m = 1 and t{x) = x. The one-parameter 
family of distributions with densities {f{x; 0), 0 G 0 C M} given by 




1.3 Exponential Families 



7 



f{x;e) = c{e)e^^hix) (1.4) 

is called a natural exponential family (NEF) [121]. 

If h{x) is a pdf itself, then c~^{0) is the corresponding moment generating 
function: 

c~^{6) = j e^® h{x) p{dx) . 

It is sometimes convenient to introduce instead the logarithm of the moment 
generating function: 

C(0) J l^{dx) , 

which is called the cumulant function. We can now write (1.4) in the following 
convenient form 

(1.5) 

Example 1.1. If we take h as the density of the N(0, (j^)-distribution, 6 = \jo^ 
and C(^) = then the class {/(*;^)?^ ^ is the class of N(A,cr^) 

densities, where is fixed and A € R. 

Similarly, if we take h as the density of the Gam(a, l)-distribution, and let 
9 = 1 — X and C(^) = — aln(l — 0) = — aln A we obtain the class of Gam(a, A) 
distributions, with a fixed and A > 0. Note that in this case 0 = (— oo, 1). 

There are many NEFs. In fact, starting from any pdf /o we can easily 
generate a NEF {/gi, 6 e 0} in the following way: Let 0 be the largest interval 
for which the cumulant function ( of fo exists. This includes 6 = 0, since /o 
is a pdf. Now define 

fe{x) = /o(x) . (1.6) 

Then {fe, 6 e 0} thus defined, is a NEF. We say that fe is obtained from fo 
by an exponential twist with twisting parameter 6. Exponential twisting plays 
an important role in rare-event simulation as we shall see in Chapter 3. 

In some cases it is useful to reparameterize a NEF. In particular, suppose 
that a random variable X has a distribution in some NEF {fe}- It is not 
difficult to see that 

= C{9) and Ynre{X) = C'{9) . (1.7) 

Since (^'{6) is increasing in 6, its derivative C^^(^) = Var^(X) is always greater 
than 0, and thus we can reparameterize the family using the mean v = KqX. 
In particular, to the above NEF there is a corresponding family {g^} such 
that for each pair {6,v) satisfying C^(^) = v we have gv = fe- 

Example 1.2. Consider the second case in Example 1.1. Note that we con- 
structed in fact a NEF {fe^9 G (— oo,l)} by exponentially twisting the 
Gam(a, 1) distribution, with density fo{x) = x^~^e~^ / r{a). We have C'(^) = 
a/{l — 6) = V. This leads to the reparameterized density 




8 



1 Preliminaries 



gv{x) = exp (6x + aln(l - 6)) fo{x) = 



exp(-gx) (g)° ^ 

r{a) 



corresponding to the Gam(a, an distribution, n > 0. 



1.4 Efficiency of Estimators 



In this book we will frequently use 



e 




z=l 



( 1 . 8 ) 



which presents an unbiased estimator of the unknown quantity £ = E£ = EZ, 
where Zi, . . . , are independent replications of some random variable Z. 

By the central limit theorem, £ has approximately a N(£, iV“^Var(Z)) dis- 
tribution, for large N. We shall estimate Var(Z) via the sample variance 






1 



N-l 






i=l 



By the law of large numbers, 5^ converges with probability 1 to Var(Z). Con- 
sequently, for Var(Z) < oo and large AT, the approximate (1 — a) confidence 
interval for £ is given by 



where Zx-aj 2 is the (1 — a/2)-quantile of the standard normal distribution. 
For example, for a = 0.05 we have = ^ 0.975 = 1-96. The quantity 

£ 

is often used in the simulation literature as an accuracy measure for the es- 
timator £, For large N it converges to the relative error (RE) of defined 
as: 

\/var(?) ^ ^Va.T{Z)/N 

— _ — 

The square of the relative error 

2 _ Var(?) 

^2 

is called the squared coefficient of variation (SCV). 



( 1 . 10 ) 




1.4 Efficiency of Estimators 



9 



Example 1.3 (Estimation of Rare- Event Probabilities). Consider estimation of 
the tail probability £ = P(X ^ 7) of some random variable X, for a large 
number 7. If £ is very small, then the event {X ^ 7} is called the rare event 
and the probability P(X ^ 7) is called the rare- event probability. 

We may attempt to estimate £ via (1.8) as 

_ 1 ^ 



which involves drawing a random sample Xi, . . . ,Xn from the pdf of X and 
defining the indicators Zi = . The estimator £ thus 

defined is called the crude Monte Carlo (CMC) estimator. For small £ the RE 
of the CMC estimator is given by 



_ h-£ ^ l~Y~ 
E? ~ y N£ 



( 1 . 12 ) 



As a numerical example, suppose ^ = 10 In order to estimate £ accurately 
with relative error (say) k = 0.01 we need to choose a sample size 



N 




This shows that estimating small probabilities via CMC is computationally 
meaningless. 



Complexity 

The theoretical framework in which one typically examines rare-event prob- 
ability estimation is based on complexity theory., as introduced in [18, 102]. 
In particular, the estimators are classified either as polynomial-time or as 
exponential-time. It is shown in [18, 148] that for an arbitrary estimator, £ of 
to be polynomial-time as a function of some 7, it suffices that its squared 
coefficient of variation or its relative error ^ k, be bounded in 7 by some 
polynomial function, ^(7). For such polynomial-time estimators the required 
sample size to achieve a fixed relative error does not grow too fast as the event 
becomes rarer. 

Consider the estimator (1.11) and assume that £ becomes very small as 
7 00. Note that 

EZ^ > {EZf = £^ . 

Hence, the best one can hope from such estimator is that its second moment 
of Z^ decreases proportionally to as 7 00. We say that the rare-event 

estimator (1.11) has bounded relative error if for all 7 

^ c£‘^ , 



(1.13) 




10 



1 Preliminaries 



for some fixed c ^ 1. Because bounded relative error is not always easy to 
achieve, the following weaker criterion is often used. We say that the estimator 
( 1 . 11 ) is logarithmically efficient (sometimes called “asymptotically optimal”) 
if _ 

lim . = 1 . (1.14) 



7— >oo 



In 



Example I .4 (The CMC Estimator is not Logarithmically Efficient). Consider 
the CMC estimator (1.11). We have 

= EZ = ^ , 



so that 



lim 



lnEZ2 



7^00 In ^^( 7 ) 



In^ 1 

hi^ "" 2 ’ 



Hence, the CMC estimator is not logarithmically efficient and therefore alter- 
native estimators must be found to estimate small £. 



1.5 Information 

In this section we discuss briefly various measures of information in a random 
experiment. Suppose we describe the measurements on a random experiment 
via a random vector X = (Xi, . . . , Xn) with pdf /. Then, all the information 
about the experiment (all our probabilistic knowledge) is obviously contained 
in the pdf /. However, in most cases we wish to characterize our information 
about the experiments with just a few key numbers. Well-known examples 
are the expectation and the covariance matrix of X, which provide informa- 
tion about the mean measurements and the variability of the measurements, 
respectively. Another informational measure comes from coding and commu- 
nications theory, where the Shannon entropy characterizes the average amount 
of bits needed to transmit a message X over a (binary) communication chan- 
nel. Yet another approach to information can be found in statistics. Specif- 
ically, in the theory of point estimation the pdf / depends on a parameter 
vector 0. The question is how well 6 can be estimated via an outcome of X, 
in other words, how much information about 0 is contained in the “data” X. 
Various measures for this type of information are associated with the maxi- 
mum likelihood, the score and the (Fisher) information matrix. Finally, the 
amount of information in a random experiment can often be quantified via 
a “distance” concept; the most notable for this book is the Kullback-Leibler 
“distance” (divergence), also called the cross- entropy. 




1.5 Information 



11 



Shannon Entropy 

One of the most celebrated measures of uncertainty in information theory is 
the Shannon entropy^ or simply entropy. A good reference is [39], where the 
entropy of a discrete random variable X with density / is defined as 

^ -Elog 2 f{X) = - ^ f{x) log 2 f{x) . 

Here X is interpreted as a random character from an alphabet A', such that 
X = X with probability f{x). We will use the convention OlnO = 0. 

It can be shown that the most efficient way to transmit characters sam- 
pled from / over a binary channel is to encode them such that the num- 
ber of bits required to transmit x is equal to log 2 (l//(x)). It follows that 
~ Sa' /(^) log 2 /(x) is the expected bit length required to send a random 
character X ~ /; see [39]. 

A more general approach, which includes continuous random variables, is 
to define the entropy of a random variable X with density / (with respect to 
some measure p) by 

= -ElnfiX) = - j /(x)ln/(x)M(dx) . (1.15) 

Definition (1.15) can easily be extended to random vectors X as 

J{(X) =: -Eln/(X) = -J /(x)ln/(x)/i(dx) . (1.16) 

Often J{(X) is called the joint entropy of the random variables Xi, . . . , X^, 
and is also written as M(Xi, . . . ,Xn). When p is the Lebesgue measure we 
have 

Jf(X) = - y /(x) In /(x) dx . , 

which is frequently referred to as the differential entropy^ to distinguish it 
from the discrete case. 

Example 1.5. Let X have a Ber(p) distribution, for some 0 ^ p ^ 1. The 
density / of X is given by /(I) = P(X = 1) = p and /(O) = P(X = 0) = 1 -p, 
so that the entropy of X is 

Jf(X) = -p Inp - (1 - p) ln(l - p) . 

Note that the entropy is maximal for p = 1/2, which gives the “uniform” 
density on {0, 1}. Next, consider a sequence Xi, . . . , X^ of i.i.d. Ber(p) random 
variables. Let X = (Xi , . . . , X^). The density of X, g say, is simply the product 
of the densities of the X^, so that 

n n 

:H(X) = -Elng(X) = -Eln J|/(Xi) = J^-Elnf(Xi) = nJf(X) . 

i=l i=l 




12 



1 Preliminaries 



The properties of ?{(X) in the continuous case are somewhat different 
from the discrete one. In particular, 

1. The differential entropy can be negative whereas the discrete entropy is 
always positive. 

2. The discrete entropy is insensitive to invertible transformations, whereas 
the differential entropy is not. Specifically, if X is discrete, Y = ff(X) 
and g is an invertible mapping, then TC(X) = ^lf(Y), because /Y(y) = 
fx(g~^(y))‘ However, in the continuous case we have an additional factor 
due to the Jacobian of the transformation. 



It is not difficult to see that of any density /, the one that gives the 
maximum entropy is the uniform density on JM. That is. 



Jf(X) is maximal /(x) = 



1 

fi{X) 



(constant) . 



(1.17) 



This of course up to a set of /i-measure 0. 

For two random vectors X and Y with joint pdf / we define the conditional 
entropy of Y given X as 



Jf(Y|X) = -Eln 



/(X,Y) 

/x(X) 



= JC(X,Y)- J£(X) , 



(1.18) 



where /x is the pdf of X and is the conditional density of Y (at y), 

given X = X. It follows that 



(X, Y) = Jf (X) + (Y I X) = (Y) + (X I Y) . (1.19) 



It is reasonable to impose that any sensible additive measure describing 
the average amount of uncertainty should satisfy at least (1.19) and (1.17). It 
follows that the uniform density carries the least amount of information, and 
the entropy (average amount of uncertainty) of (X, Y) is equal to the sum of 
the entropy of X and the amount of entropy in Y after the information in X 
has been accounted for. 

It is argued in [99] that any concept of entropy that includes the general 
properties (1.17) and (1.19) must lead to the definition (1.16). 

The mutual information of X and Y is defined as 

M(X, Y) = Jf (X) + (Y) - JC(X, Y) , (1.20) 



which as the name suggests can be interpreted as the amount of information 
shared by X and Y. An alternative expression, which follows from (1.19) and 
(1.20), is 

M(X, Y) = J{(X) - J{(X I Y) = Jf (Y) - Jf (Y I X) , (1.21) 

which can be interpreted as the reduction of the uncertainty of one random 
variable due to the knowledge of the other. It is not difficult to show that the 
mutual information is always positive. It is also related to the cross-entropy 
concept, which follows. 




1.5 Information 



13 



Kullback-Leibler Cross-Entropy 



Let g and h be two densities with respect to the measure /jl on X. The 
Kullback-Leibler cross-entropy between g and h (compare with (1.16)) is de- 
fined as 



D(c/, h) = Eg In 



g(X) 

h{X) 



= J 9 i^) ln5(x) n{dx) - j g{x) ln/i(x) /x(dx) 



( 1 . 22 ) 



D(^, /i) is also called the Kullback-Leibler divergence, the cross-entropy and 
the relative entropy. If not stated otherwise, we shall call T>{g,h) the cross- 
entropy (CE) between g and h. Notice that ‘D{g, h) is not a “distance” between 
g and h in the formal sense, since in general T){g^h) ^ D(/i,^). Nonetheless, 
it is often useful to think of D(^, /i) as a distance because 



B(^,/i)^0 



and D(^, /i) = 0 if and only if g{x) = h{x). This follows from Jensen’s inequal- 
ity (if 0 is a convex function, such as — In, then E0(X) ^ (f)(EX)). Namely, 



V{g,h) = Eg 



[-in'*™] 


^ — In 




ff(X)J 




"5(X)J 



= - In 1 = 0 



It can be readily seen that the mutual information M(X, Y) of vectors X 
and Y defined in (1.20) is related to the CE in the following way: 



M(X, Y) = D(/. /x/v) = E/ In ■ 

where / is the (joint) pdf of (X, Y), and /x and /y are the (marginal) pdfs of 
X and Y, respectively. In other words the mutual information can be viewed 
as the CE that measures the “distance” between the joint pdf / of X and Y 
and the product of their marginal pdfs /x and /y, that is under assumption 
that the vectors X and Y are independent. 

The CE distance is a particular case of the Ali-Silver “distance” [8] be- 
tween two probability densities g and /i, which is defined as 



d{g,h)=i 




h{X)) 



(1.23) 



where ^(-) is a continuous convex function on (0, +(X)) and </>(•) is an increasing, 
real-valued function of a real variable. For the CE distance we have 0 = In 
and ^{y) = y. In Chapter 3 we will consider also the variance minimization 
(VM) “distance,” where (j){y) = y^ and ^{y) = y. 

Finally, we mention that the CE is often used in Bayesian statistics to 
measure the “distance” between prior and the posterior distributions. In par- 
ticular, CE enables selection of the most “uninformative” prior distribution. 




14 



1 Preliminaries 



which is crucial in Bayesian statistics and is associated with the maximum 
entropy principle [23] . It should be noted that in Bayesian inference the cross- 
entropy is usually written with the opposite sign, so that maximization of the 
Bayesian cross-entropy corresponds to minimization of the CE distance. 

The MLE and the Score Function 

We introduce here the notion of the score function via the classical maximum 
likelihood estimator (MLE). Consider a random vector X = (Xi,...,Xn), 
which is distributed according to a fixed pdf /(•; ^) with unknown parameter 
(vector) 0 E 0. Assume that we wish to estimate 6 on the basis of a given 
outcome x (the data) of X. For a given x, the function L(0;x) = f{x;0) 
is called the likelihood function. Note that £ is a function of 6 for a fixed 
parameter x, whereas for the pdf / it is the other way around. The maximum 
likelihood estimate MLE 6 = 6{x) oi 9 is defined as 

6 = argmaxL(0;x) . (1*24) 

eee 

Because the function In is monotone increasing we also have 

0 = argmaxln£(0;x) . (1.25) 

eee 

The random variable 0(X) with X ~ f(';0) is the corresponding maximum 
likelihood estimator, which is also abbreviated as MLE and again written as 
0. Note that often the data Xi,. . . , Xn form a random sample from some pdf 
/i(-; e), in which case /(x; 6) = Hili hixi', 9) and 

N 

0 = argmax V' In /i {Xi; 9) . (1-26) 

If L(0;x) is a continuously differentiable concave function with respect to 
9 and the maximum is attained in the interior of 0, then we can find the 
MLE of 0 by solving 

V0ln£((9;x) = O. 

The function §(-;x) defined by 

S(0;x) = Veln£(0;x) = (1-27) 

is called the score function. For the exponential family (1.3) it is easy to see 
that 

§(0;x) = ^^+t(x). (1.28) 

The random vector S(0) = S(0;X) with X ~ /(*;^) is called the (efficient) 
score. The expected score is always equal to the zero vector, that is 




1.5 Information 



15 



Ee§(0) = I V0/(x;0)/i(rfx) = Vfl y'/(x;0)Mdx) 



Vel = 0 



where the interchange of differentiation and integration is justified via the 
bounded convergence theorem. 



Fisher Information 



The covariance matrix J(0) of the score 8(0) is called the Fisher information 
matrix. Since the expected score is always 0, we have 

3{6) = Ee§(0)8(0)^ . (1.29) 



In the 1-dimensional case we thus have 



m = Ee 



( dhif{x-,e) 
V de 



Because 






f(x; 0) 



de'^ ’ fix; 9) 



/I 



de 



f{x; 9) 



\ 



f{x-,9) 



) 



we see that (under straightforward regularity conditions) the Fisher informa- 
tion is also given by 

a2ln/(X;0) 

m = -^e — — . 

In the multidimensional case we have similarly 



9(0) = -Ee V§(0) = -Ee In /(X; 0) , (1.30) 

where V^ln/(X;0) denotes the Hessian of ln/(X;0), that is the (random) 
matrix 

/ g"ln/(X;0) \ 

V d9id9j ) • 

The importance of the Fisher information in statistics is corroborated by the 
famous Cramer-Rao inequality which (in a simplified form) states that the 
variance of any unbiased estimator Z of g{6) is bounded from below via 

Var(Z) ^ {Xg{d)f 9-^(0) Xg{e) . (1.31) 



For more details see [107]. 




16 



1 Preliminaries 



1.6 The Score Function Method * 



Many real-world complex systems in science and engineering can be modeled 
as discrete-event systems (DES). The behavior of such systems is identified via 
a sequence of discrete “events,” which causes the system to change from one 
“state” to another. Because of their complexity, the performance evaluation 
of DES is usually studied by simulation and it is often associated with the 
estimation of the following response function 

e = ^(u) = E„F(X) = J H{x) /(x; u) /x(dx) . (1.32) 

Here X is assumed to be distributed /(-;u), u G V with respect to some 
base measure /i, and H{X) is called the sample performance, for example, the 
steady-state waiting time process in a queueing network. Sensitivity analysis 
is concerned with evaluating sensitivities (gradients, Hessians, etc.) of the 
response function £{u) with respect to parameter vector u and it is based on 
the score function and the Fisher information. It provides guidance for design 
and operational decisions and plays an important role in selecting system 
parameters that optimize certain performance measures. 

In this section we give a brief review of the score function (SF) method, 
a powerful approach to sensitivity analysis and optimization of DES via the 
score function. This approach was independently discovered in the late 1960s 
by Aleksandrov, Sysoyev and Shemeneva [7]; Mikhailov [119]; Miller [120] and 
Rubinstein [140]. For relevant references see [16, 17, 67, 106, 136, 141, 143, 
149, 148, 156]. 

The SF approach permits estimation of all sensitivities (gradients, Hes- 
sians, etc.) from a single simulation run (experiment), in the course of eval- 
uating performance (output) measures. Moreover, it can handle sensitivity 
analysis and optimization with hundreds of decision parameters [148]. 

To proceed let us write (1.32) as 

^(u) = Evi?(X) W{X-, u, v) = j F(x) v) /i(dx) , (1.33) 



where 



IF(X;u,v) 



/(X;u) 

/(X;v) 



is the likelihood ratio (LR) of /(•; u) and /(•; v), X ~ /(•; v) and it is assumed 
that all densities /(•; v), v G V have the same support. For a fixed v ^ u the 
density /(*;v) is called the importance sampling density, which will play an 
important role in this book. ^ 

An unbiased estimator of ^(u) is ^(u;v) = ^ i7(Xi) IF(Xi; u, v), 
which is called the likelihood ratio estimator. Here Xi, . . . ,Xiv is a random 



This section can be omitted at first reading. 




1.6 The Score Function Method 



17 



sample from /(•; v). Note that ^(u; v) is an unbiased estimator of ^(u) for all 
V. Therefore, by varying u and keeping v fixed we can estimate unbiasedly 
the whole response surface {^(u),u G V} via a single simulation. Moreover, 
we can estimate from that single simulation (single sample Xi, . . . , X^v) also 
the sensitivities of meaning the gradient V^(u), the Hessian V^^(u) and 
higher order derivatives. 

Let us first consider the gradient of £. Assume first that the parameter u is 
a scalar and the parameter set V is an open interval on the real line. Suppose 
that for all x, the function /(x; u) is continuously differentiable in u and that 
there exists an integrable function /i, such that 



H{x) 



d/(x; u) 
du 



^ h{x) 



(1.34) 



for all u G V. Then by the Lebesgue Dominated Convergence theorem, the 
differentiation and expectation (integration) operators are interchangeable, so 
that differentiation of (1.32) yields 



di{u) 

du 



^ w)M(dx) = J fi{dx) 

/.(x)^/(x.)Mdx)..„.(X)™ 



= EuH{X)S{u;X) . 



Consider next the multidimensional case. Similar arguments allow us to rep- 
resent the gradient and the higher order derivatives of ^(u) in the form 

V*=€(u) = EuH{X) X) , (1.35) 



where 



§('=)(u;x) 



VV(x;u) 
/(x; u) 



(1.36) 



is the k-th order score function, k = 0, 1,2, In particular, §(°^(u;x) = 1 

(by definition), S^^)(u;x) = §(u;x) and S^^^(u;x) can be represented as 



(u; x) = V§(u; x) + §(u; x) §(u; x)^ 

= V2 In /(x; u) + V In /(x; u) V In /(x; u)^ 



(1.37) 



All partial derivatives are taken with respect to the components of the pa- 
rameter vector u. 

Applying likelihood ratios to (1.35) yields 

V'=^(u) =Evi?(X)8(*^)(u;X) W^(X;u,v) . (1.38) 



Table 1.4 displays the score functions §(u;x) for the commonly used dis- 
tributions in Table 1.3. 




18 



1 Preliminaries 



Table 1.4. Score functions for commonly used distributions. 



Distr. u S(u;x) 



Gam(a, h) 


r{a) 


(a,b) 


(\n{bx) - 


-tM ab-^-x) 

r(a) > •^) 


N{a,b^) 


1 1 / X — a \2 

Q 2\ b J 


(a,b) 


{b~^(x - a), 


-6“^ +6“®(a;-a)^) 


Weib(a, b) 




(a, b) (a 


. ^ + ln(6a;)[l - (te)“], f [1 - (6x)“]) 


Bin(n,p) 


Qp"(l-p)"-== 


P 




X — np 
p(i - p) 


Po(a) 


a^e~°‘ 

x\ 


a 




2-1 

a 


G(p) 


p(i-p)"-' 


P 




1 — px 
p(i -p) 


In general, the quantities V^^(u), k 


= 0,1,..., 


are not available analyti- 



cally, since the response ^(u) is not. They can be evaluated, however, either by 
conventional deterministic numerical methods [164] or via simulation. Simula- 
tion is particularly convenient, as the response, ^(u), and all the sensitivities, 
V^^(u), A; = 1, 2, . . ., are expressed as expectations with respect to the same 
pdf, /(x; v). The LR estimator for V^^(u) is 



Z=1 

It is important to note that the LR estimators V^^(u), A: = 0, 1, . . ., allow us to 
estimate the corresponding V^^(u) at virtually any point u G V, provided the 
interchangeability of integration and differentiation is valid. This fact renders 
the above estimators particularly suitable for solving optimization problems. 

The variance of ^(u, v) = V®^(u, v) is determined by the second moment 
of if(X)VP(X;u, v), with X ~ /(*;v). For exponential families of the form 
(1.3) explicit formulas can be derived. Specifically, with 6 taking the role of 
u and r/ the role of v we have 



{H{X)W{X-, 0, 77)}" = EeH^{X)W{X-, 0, r,) 

= f c{0) /i(x) n{dx) 

J c(t7) 

- j e(2e-»?) t(x) 



c(t7) 



c2(0) 



c(t7) c{ 20 - ri) 



E2e_^H2(X) 



= Er,W^{X-,0,rji) E2e-^/f2(X). 



(1.40) 




1.7 Generating Random Variables 



19 



Confidence regions for V^£(u) can also be obtained, by standard techniques; 
see [149]. 

Table 1.5 displays the second moments u, u), for common expo- 

nential families in Tables 1.3 and 1.4. Note that in Table 1.5 we change one 
parameter only, which is denoted by u and is changed to v. For the Gamma 
and Weibull we have used a reparameterization that will be convenient in the 
following chapters. The values of tz, u) are calculated via (1.3) and 

(1.40). In particular, we first reparameterize the distribution in terms of (1.3), 
with 6 = ^^{u) and rj = '^(t’), and then calculate 

At the end we substitute u and v back in order to obtain the desired 



Table 1.5. for commonly used distributions. 



Distr. 


f{x;u) 


Gam(a, u~^) 


r(a) 


N(u, h^) 


1 1 / x — u \2 

— —e 6 ) 

by/^ 


Weib(a, u~^) 




Bin(n, u) 




Po{u) 


x\ 


Q{u) 





II 




E^W^{X-,u,v) 




(-(9)2“ 


( V 


r(a) 


\u{2v — u) ) 


u 




e(^)^ 


¥ 


by/^ 


u~°' 


aO 


2 (v/w)“ - 1 


1 n 


(1+e'')-” 


/ «2 — 2uv + W Y 


In 


V (1 - ■“)« / 


1 — u 






/ (u-v)^ \ 


Inu 


e-e' 


eV “ ; 


ln(l — u) 


l-e« 


u^{v — 1) 


v{u‘^ — 2u -h u) 



1.7 Generating Random Variables 

This section we briefly describe several methods for generating random vari- 
ables from a prescribed distribution. 

1.7.1 The Inverse- Transform Method 

Let X be a random variable with cdf F. Since F is a nondecreasing function, 
the inverse function F~^ may be defined as 

F~^{y) = inf {a: : F{x) ^ y} , 0 < y < 1 . 



(1.42) 




20 



1 Preliminaries 



It is easy to prove that if ~ U(0, 1), then 

X = F~^{U) (1.43) 

has cdf F{x). Namely, since F is invertible and P(!7 ^ u) = we readily 
obtain that 



P(X ^ x) = ¥{F~^{U) ^x) = F{U ^ F{x)) = F{x) . (1.44) 

Thus, to generate an outcome, say x, of a random variable X with cdf F, 
first sample an outcome, say u, from U ~ U(0, 1), compute F~^{u)^ and set 
it equal to x. Figure 1.1 illustrates the inverse-transform method summarized 
in the following algorithm: 

Algorithm 1.7.1 (The Inverse- Transform Method) 

1. Generate U from U(0, 1). 

2. Return X = F~^{U). 



Fig. 1.1. The inverse-transform method. 




In general, the inverse-transform method requires that the underlying cdf, 
F, exists in a form for which the corresponding inverse function F~^ can be 
found analytically or algorithmically. Applicable distributions are, for exam- 
ple, the exponential, uniform, Weibull, logistic, and Cauchy distributions. 

For a discrete m-point random variable the inverse-transform method can 
be written as follows: 

Algorithm 1.7.2 (The Inverse- Transform Method for a Discrete 
Distribution) 

1. Generate U ~ U(0, 1). 

2. Find the smallest positive integer, k, k = 1, . . . ,m, such that U ^ F(x^) 
and return X = Xk- 




1.7 Generating Random Variables 



21 



Much of the execution time in Algorithm 1.7.2 is spent in making the compar- 
isons of Step 2. This time can be reduced by using efhcient search techniques; 
see [47]. 

The inverse-transform method can be easily extended to random vectors 
X = {Xi, . . . ,Xn) of independent random variables from a given joint cdf 
F(x). In this case we simply apply the method to each component separately. 
For dependent random variables the applicability is quite limited since it re- 
quires knowledge of the marginal and conditional distributions of the see 
[148]. 

1.7.2 Alias Method 

An alternative to the inverse-transform method for generating discrete random 
variables, which does not require time consuming search techniques as per Step 
2 of Algorithm 1.7.2, is the so-called alias method [170]. It is based on the fact 
that any arbitrary n-point pdf /, with 

f{xi)=¥{X = Xi)=pi , i = 

can be represented as an equally weighted mixture of n — 1 discrete pdfs, 
k = l,...,n — 1, each having at most two nonzero components. That is, 
any n-point pdf / can be represented as 

('■«) 

for suitably defined 2-point pdfs g^^\ A: = 1, . . . , n — 1; see [170]. 

The alias method is rather general and efficient, but requires an initial 
setup and extra storage for the n — 1 pdfs, g^^\ A procedure for computing 
these two-point pdfs, g^^\ /c = 1, . . . ,n — 1 can be found in [47]. Once the 
representation (1.45) has been established, generation from / is simple and 
can be written as: 

Algorithm 1.7.3 

1. Generate a random random variable U from the discrete uniform pdf 

DU{1, n — 1}. Let k {k = ... — 1) be the outcome. 

2. Generate a random random variable X from the two-point pdf g ^^^ , k = 
l,...,n- 1. 

1.7.3 The Composition Method 

This method assumes that a cdf, F, can be expressed as a mixture of cdfs Hi^ 
that is: 

n 

F{x) = Y,PiHi{x) , 

2=1 



(1.46) 




22 



1 Preliminaries 



where 

n 

Pi > 0, ^Pi = 1 . 

i=l 

Let Xi ~ Hi and let F be a discrete random variable with P(Y' = i) = pi and 
independent of for 1 ^ i < n. Then the random variable X with cdf F 
can be represented as 

n 

X = '£XiI{Y=i}- 
2=1 

It follows that in order to generate X from F, we must first generate a discrete 
random variable Y given above, and then given Y = i, generate Xi from Hi. 
We thus have 

Algorithm 1.7.4 (Composition Method) 

1. Generate the random variable Y according to 

P(y = i) =p- ^ i = 1, . . . ,n . 

2. Given Y = i, return X from the cdf Hi. 

1.7.4 The Acceptance— Rejection Method 

The inverse-transform and the composition methods are direct methods in 
the sense that they deal directly with the cdf of the random variable to be 
generated. Unfortunately, for many probability distributions it is difficult or 
impossible to find the inverse transform, that is, to solve 




with respect to x. Even if F~^ exists in an explicit form, the inverse-transform 
method may not necessarily be the most efficient variate generation method 
[47]. The acceptance-rejection method (ARM), which presents an indirect 
method and is due to John von Neumann, may be appealed to when the 
above-mentioned direct methods either fail or turn out to be computationally 
inefficient. 

To carry out the ARM we need to specify (1) a pdf h from which it is 
easy to generate a random variable, and (2) a constant C ^ 1 such that 
C h{x) ^ f{x) for all x. We thus represent f{x) as 

f{x) = Ch{x)g{x) , (1.47) 

where 0 ^ ^(x) ^ 1 (see Figure 1.2). 

According to the ARM, we generate independently two random variables, 
U from U(0, 1) and Y from /i(y), and test the inequality U ^ g{Y). If the 
inequality holds, then we accept Y as the required random variable from f{x); 
otherwise, we reject the pair {U,Y) and try again until a successful pair (U,Y) 
is obtained. The acceptance-rejection algorithm can be written as follows: 




1.7 Generating Random Variables 



23 



Algorithm 1.7.5 (Acceptance-Rejection Method) 

1. Generate U from U(0, 1). 

2. Generate Y from h{y), independent ofU. 

3. If U ^ g{Y) return X = Y . Otherwise, go to Step 1. 



Fig. 1.2. The acceptance-rejection method. 
Ch{x) 




The theoretical justification of the algorithm follows from the application 
of Bayes’ formula to the conditional density /y(x | C/ ^ g{Y)), which can be 
written as 



fYix\UK9{Y)) = 



¥{U ^ g{Y) \Y = x) h{x) 
m ^ g{Y)) 



(1.48) 



Direct computations yield 



F(U < g{Y) \Y = x)=¥{U^ g{x)) = g(x) (1.49) 



and 

¥{U ^ g{Y)) = j¥(U ^ g{Y) | y = x) h{x) dx (1.50) 

= j g{x) h{x) j ^ • 

Upon substitution of (1.49) and (1.50) into (1.48), we obtain 



fY{x\U ^ g{Y)) = Ch{x)g{x) = f{x) . 



The efficiency of the acceptance-rejection method is determined by the 
acceptance probability p = ¥{U ^ gO^)) = ^/C (see (1.50)). Note that the 
number of trials, say N, before a successful pair {U, Y) occurs has a geometric 
distribution, 

¥{N = n)=p{l-pr-\ n = l,2,..., (1.51) 

with the expected number of trials equal to l!p = C, 

For this method to be of practical interest, the following criteria must be 
used in selecting h{x) : 




24 



1 Preliminaries 



1. It should be easy to generate a random variable from h{x). 

2. The efficiency, 1/C, of the procedure should be not too small, that is, h{x) 
should be “close” to f{x). 

Algorithm 1.7.5 is directly applicable to the multidimensional case, us- 
ing the same reasoning. We need only bear in mind that Y (see Step 2 of 
algorithm 1.7.5) becomes an n-dimensional random vector, rather than a one- 
dimensional random variable. Consequently, we need a convenient way of gen- 
erating Y from the multidimensional pdf h{y). However the efficiency of the 
ARM decreases dramatically with n if /i(x) ^ /(x). In practice it can be used 
only for n < 10; see [148]. 

1.8 Exercises 

Probability Distributions 

1. Let Y = e^, where X ~ N(0, 1). That is, Y has a log-normal distribution. 

Show that 

Ee^^ = oo , for all 5 > 0 . 

In other words, Y has a heavy-tail distribution. 

2. Let X ~ Beta (a, b). Show that 

ab 

(a -h 6)^ (a + 6 + 1) 

oo and anjbn ^ c, as n ^ oo. 
c/(c + 1), that is, 

> e) = 0 , 



EX = 



CL b 



and Var(A) 



3. Let Xn ~ Beta (an,5n). Suppose an, bn 
Show that Xn converges in probability to 



lim P 

n->oo 






C+ 1 



4. 

5. 



for all £ > 0. 

Let U ~ U[0, 1]. Show that A = ~ Beta(u, 1) and 1 — X ~ Beta(l,^;). 

Show that the cumulant function of the Beta (2, 1) distribution is given by 

'2e^(s-l)-l-2^ 



C(s) = In 



s G 



6. Consider the collection of pdfs {/(•; u), v > 0}, with /(x; v) = vx'^~^, 0 < 
X ^1. Show that this collection (of Beta(^;, 1) pdfs) forms an exponential 
family of the form (1.3), and specify 0, c{0), t{x) and h{x) in terms of the 
parameter v. 




1.8 Exercises 



25 



Information 

7. Random variables X and Y have joint density / given by 

'1/3, (x,y) = (0,0) 

= J 1/3, (x,y) = (0,l) 

] 0, (x,y) = (l,0) 

^ 1/3, (x,y) = (1,1) . 

Find the following quantities: 

a) 5{(X), 0{{Y). 

b) !K(X|y), Jf(V|X). 

c) J£(X,y). 

d) :k(x)- j{(x|y). 

e) M(X,y). 

8. Let X, y and Z be random variables. Prove the following statements: 

a) 0<{X\Y)<:K{X). 

b) :k(x, y\z) = ^{{x I z) + M (y | x, z). 

c) The mutual information of X and Y satisfies 

M(x, y) = :k(x) - :k(x | y) = :k(y) - :k{y \ x ) . 

9. Prove the following statements: 

a) T}{g,h) ^ 0. 

b) D(^, h) is convex in the pair (^, h). 

c) If |X| = m and u is the (discrete) uniform pdf over A', then D{g,u) = 
Inm — 9f(^). 

10. Let Xi, X 2 , . . . , Xn be a random sample from pdf 

f{x;0)=6~^x~^^^^ x>l, 

for 0 < 0 < 1/2. The method of moment estimator (MME) of 6 is obtained 
by solving, with respect to 6, the equation 

N 

h{e) = N-^Y.Xi = x , 

i=l 

where h{6) = E^Xi = 1/(1 — 0). It follows that the MME is given by 



Show that the maximum likelihood estimator (MLE) is given by 

^ = — N — • 




26 



1 Preliminaries 



Verify the following statements: 

a) Ee \nXi = 6, 

b) Var0(lnXi) = 6^ , 

c) Var0(Vi) = (i- 20 )(i- 0)2 ? 

d) The Fisher information corresponding to Xi is given by 3 (6) = 1/0^. 

Check that, as a consequence of (a)-(c), both 6 and 6 are asymptotically 
unbiased estimators of 0, that is, 6 as N ^ oo^ and similar for 6. 

In fact, 6 is unbiased for every N. It is easy to see that by the central limit 
theorem y/N{6 — 9) converges in distribution to a N(0, 0^) distribution. 
Similarly, it can be shown (see Theorem 5.4.1 of [153]) that y/N{6 — 0) 
converges in distribution to a N(0, Var0(Xi)/(/i'(0))^)-distribution. Show 
that for large N the MLE is more efficient than the MME. Finally, note 
that property (d) shows that the variance of 0 attains the theoretically 
smallest possible value (Cramer-Rao bound (1.31)) of !/{3{0)N)] in other 
words, the MLE is (asymptotically) efficient. 



The Score Function Method 

11. Calculate the score function § for the Beta (a, 6) pdf 

fix) = fla)r(b) ^ ^ ^ ® 

and Pareto (a, 6) pdf 

f(x) = ab(l-hbx)~^^~^^\ X G R+, a,b>0. 

12. Calculate EyW‘^{X'.,u^v) for the Beta(t^, 1) and Pareto(i;, 1) distributions. 

13. Show that 

VuIV(X; u, v) = IV(X; u, v) §(u; X) . 

14* Show that in analogy to (1.40) the covariance matrix of H{X) VqW (X; 0, r)) 
with X f(-;v), is given by 

Er,w\x-,e,ri) E20-r,H\x)s{e-,x)s'^{e;X)-m{e)[ve{e)f . 

15* Consider the exponential pdf f{x;9) = 0 exp{—0x). Show that if S{x) is 
a monotonically increasing function, then the expected performance £ = 
EeS{X) — which is assumed to be finite — is a monotonically decreasing, 
convex function of 0 G (0, oo). 




1.8 Exercises 



27 



Generating Random Variables 

16. Apply the inverse-transform method to generate random variables from a 
Laplace distribution (shifted two-sided exponential distribution), with 

^ . a; G K , /? > 0 . 

17. Apply the inverse-transform algorithm to generate random variables from 
the piecewise-constant pdf 

{ Ci , Xi-i^x^Xi, i = 1,2, ...,n , 

0 , otherwise , 

where xq < xi < - • < Xn-i < Xn and Q ^ 0, i = 1, . . . , n. 



1S1 Let 



fix) = 



Ci X , Xi-i ^ X < Xi , z = 1, . . . , n , 
0 , otherwise , 



where 0 ^ xq < xi < • - < Xn-i < Xn and ^ 0, i = 1, . . . , n. Using the 
inverse-transform method, show that if X ~ /, then X can be written as 






x^_i + 



2{U-Fr-l) V'^ 

Cr J ’ 



where U ~ U(0,1), Fi = j^’^_^Ckydy, i = l,2,...,n, and r = 

inf{i : Fi > U}. Describe an algorithm for generating random variables 
from f{x). 



19. Apply the inverse-transform method to generate random variables from 
the following (discrete) pdf: 



fix) 



f 1 

^ n + 1 ’ 

. 0 , 



a: = 0, 1, . . . , n , 
otherwise . 



20. Using the acceptance-rejection algorithm generate a random variable from 
the pdf 



CiX , 



Xi-i ^ X < Xi , i = , 



fix) = 



0 , 



otherwise , 




28 



1 Preliminaries 



where 0 = xq < xi < - - < Xn-i < Xn = I and Q ^ 0, z = 1, . . . , n. 
Represent /(x) as 

fix) = C h{x) g(x) , 

where 

h{x) = 2x, 0<x<l. 

Calculate the efficiency of the algorithm for the case 



Cl = C2 



^ . . . 5 ? 7 < . 

n 



21. Apply the acceptance-rejection algorithm for generating a random vari- 
able from the pdf 



fix) = x^O , 

using the representation /(x) = C h{x) ^(x), where 

h{x) = ^ X > 0 , ^ > 0 , 



for fixed p. 

22. Generate a random variable from the pdf 

/(x) = ke~^ , 0 ^ X ^ a 



using 

a) the inverse-transform method 

b) the acceptance-rejection method with 

/(x) = Ch{x) g{x) , where h{x) = Ae“^^, x > 0 . 

Find the efficiency of the acceptance-rejection method for the cases a = 
1, a ^ 0 and a 00 . 

23. Let Xi ~ Gam(a, 1) and X 2 ~ Gam(6, 1) be independent. Prove that 

Beta(a,b). 

This implies a general procedure for generating random variables from the 
Beta (a, b) distribution. There are many efficient algorithms for generating 
random variables from a Gam(a, 1) distribution; see for example [142]. 




2 



A Tutorial Introduction to the Cross-Entropy 
Method 



2.1 Introduction 

The aim of this chapter is to provide a gentle and self-contained introduction 
to the cross-entropy (CE) method. We refer to Section 1.1 for additional 
background information on the CE method, including many references. 

We wish to show that 

1. the CE method presents a simple, efficient, and general method for solving 
a great variety of estimation and optimization problems, especially NP- 
hard combinatorial deterministic and stochastic (noisy) problems, 

2. the CE method is a valuable tool for Monte- Carlo simulation^ in particular 
when very small probabilities need to be accurately estimated (so-called 
rare-event simulation). 

The CE method has its origins in an adaptive algorithm for rare-event simula- 
tion, based on variance minimization [144]. This procedure was soon modified 
[145] to a randomized optimization technique, where the original variance min- 
imization program was replaced by an associated cross-entropy minimization 
problem; see Section 1.1. 

In the field of rare-event simulation, the CE method is used in conjunction 
with importance sampling (IS), a well-known variance reduction technique in 
which the system is simulated under a different set of parameters, called the 
reference parameters — or, more generally, a different probability distribution 
— so as to make the occurrence of the rare event more likely. A major draw- 
back of the conventional IS technique is that the optimal reference parameters 
to be used in IS are usually very difficult to obtain. Traditional techniques 
for estimating the optimal reference parameters [148] typically involve time- 
consuming variance minimization (VM) programs. The advantage of the CE 
method is that it provides a simple and fast adaptive procedure for estimating 
the optimal reference parameters in the IS. Moreover, the CE method also en- 
joys asymptotic convergence properties. For example, it is shown in [85] that 




30 



2 A Tutorial Introduction to the Cross-Entropy Method 



for static models — cf. Remark 2.3 — under mild regularity conditions the 
CE method terminates with probability 1 in a finite number of iterations, 
and delivers a consistent and asymptotically normal estimator for the optimal 
reference parameters. Recently the CE method has been successfully applied 
to the estimation of r are-event probabilities in dynamic models, in particular 
queueing models involving both light- and heavy-ta>il input distributions; see 
[46, 15] and Chapter 3. 

In the field of optimization problems — combinatorial or otherwise — the 
CE method can be readily applied by first translating the underlying opti- 
mization problem into an associated estimation problem, the so-called associ- 
ated stochastic problem (ASP), which typically involves rare-event estimation. 
Estimating the rare-event probability and the associated optimal reference 
parameter for the ASP via the CE method translates effectively back into 
solving the original optimization problem. Many combinatorial optimization 
problems (COPs) can be formulated as optimization problems concerning a 
weighted graph. Depending on the particular problem, the ASP introduces 
randomness in either 

(a) the nodes of the graph, in which case we speak of a stochastic node network 
(SNN), or 

(b) the edges of the graph, in which case we speak of a stochastic edge network 
(SEN). 

Examples of SNN problems are the maximal cut (max-cut) problem, the buffer 
allocation problem and clustering problems. Examples of SEN problems are 
the travelling salesman problem (TSP), the quadratic assignment problem, 
the clique problem, and optimal policy search in Markovian decision problems 
(MDPs). We should emphasize that the CE method may be applied to both 
deterministic and stochastic COPs. In the latter the objective function itself 
is random or needs to be estimated via simulation. Stochastic COPs typically 
occur in stochastic scheduling, fiow control, and routing of data networks [24] 
and in various simulation-based optimization models [148], such as optimal 
buffer allocation [9]. Chapter 6 deals with noisy optimization problems, for 
which the CE method is ideally suited. 

Recently it was found that the CE method has a strong connection with 
the fields of neural computation and reinforcement learning. Here CE has been 
successfully applied to clustering and vector quantization and several MDPs 
under uncertainty. Indeed, the CE algorithm can be viewed as a stochastic 
learning algorithm involving the following two iterative phases: 

1. Generation of a sample of random data (trajectories, vectors, etc.) accord- 
ing to a specified random mechanism. 

2. Updating the parameters of the random mechanism, typically parameters 
of pdfs, on the basis of the data, to produce a “better” sample in the next 
iteration. 




2.2 Methodology: Two Examples 



31 



The significance of the cross-entropy concept is that it defines a precise math- 
ematical framework for deriving fast and “good” updating/learning rules. 

The rest of the chapter is organized as follows. In Section 2.2 we present two 
toy examples that illustrate the basic methodology behind the CE method. 
The general theory and algorithms are detailed in Section 2.3, for rare-event 
simulation, and Section 2.4, for Combinatorial Optimization. Finally, in Sec- 
tion 2.5 we discuss the application of the CE method to the max-cut and the 
TSP, and provide numerical examples of the performance of the algorithm. 

Our intention is not to compare the CE method with other heuristics, but 
demonstrate its beauty and simplicity and promote CE for further applications 
to optimization and rare-event simulation. This chapter is based partly on [44]. 



2.2 Methodology: Two Examples 

In this section we illustrate the methodology of the CE method via two toy 
examples, one dealing with rare-event simulation, and the other with combi- 
natorial optimization. 



2.2.1 A Rare-Event Simulation Example 



Consider the weighted graph of Figure 2.1, with random weights Xi, . . . , X 5 . 
Suppose the weights are independent and exponentially distributed random 
variables with means ?xi, . . . , W 5 , respectively. Denote the probability density 
function (pdf) of X by /(•; u); thus. 



/(x; u) = exp 




(2.1) 



Let 5(X) be the total length of the shortest path from node A to node B. 




Fig. 2.1. Shortest path from A to B. 




32 



2 A Tutorial Introduction to the Cross-Entropy Method 



We wish to estimate from simulation 



i - P(5(X) ^ 7) = E/{s(x)^7} . (2-2) 

that is, the probability that the length of the shortest path 5(X) will exceed 
some fixed 7. A straightforward way to estimate £ in (2.2) is to use crude Monte 
Carlo (CMC) simulation. That is, we draw a random sample Xi, . . . , X^v from 
the distribution of X and use 



1 

N 



N 

i=l 



(2.3) 



as the unbiased estimator of 1. However, for large 7 the probability ^ will be 
very small and CMC requires a very large simulation effort. Namely, N needs 
to be very large in order to estimate ^ accurately — that is, to obtain a small 
relative error of 0.01, say. A better way to perform the simulation is to use 
importance sampling (IS). That is, let g be another probability density such 
that ^(x) = 0 => 7{5 (x)^ 7 }/(x) = 0. Using the density g we can represent i 
as 

J 5(x) dx = Eg/{S(X)> 7 } , (2.4) 

where the subscript g means that the expectation is taken with respect to g, 
which is called the importance sampling (IS) density. An unbiased estimator 
of ^ is ^ 

(2-5) 

2=1 

where i is called the importance sampling (IS) or the likelihood ratio (LR) 
estimator, 

W{k) = f{x)/g{x.) (2.6) 

is called the likelihood ratio (LR), and Xi, . . . , X^v is a random sample from g, 
that is, Xi, . . . ,Xn are i.i.d. random vectors with density g. In the particular 
case where there is no “change of measure,” that is, g = f, we have W = 1, 
and the LR estimator in (2.6) reduces to the CMC estimator (2.3). 

Let us restrict ourselves to g such that Xi , . . . , X5 are independent and 
exponentially distributed with means Then 



VT(x;u,v) 




(2.7) 



In this case the “change of measure” is determined by the parameter vector 
V = ('yi, • • ‘ , 1^5)- The main problem now is how to select a v which gives 
the most accurate estimate of £ for a given simulation effort. As we shall see 
soon one of the strengths of the CE method for rare-event simulation is that 
it provides a fast way to determine/estimate the optimal parameters. To this 
end, without going into the details, a quite general CE algorithm for rare-event 
estimation is outlined next. 




2.2 Methodology: Two Examples 



33 



Algorithm 

1 . Define vq = u. Set t = l (iteration counter). 

2 . Generate a random sample Xi, . . . ,Xiv according to the pdf /(•; Vt_i). 

Calculate the performances 5(Xi) for all and order them from smallest 
to biggest, 5 ( 1 ) ^ ... 5(iv)- Let % be the sample (1 — ^)-quantile of 

performances: % = S'(|-(i_^);v])? provided this is less than 7 . Otherwise, 
put 7 t = 7 . 

3. Use the same sample to calculate, for j = 1 , . . . , n{= 5), 

4. If 7 t =7 then proceed to Step 5; otherwise set t = t + 1 and reiterate 
from Step 2 . 

5. Let T be the final iteration. Generate a sample Xi, . . . , X^Ti according to 
the pdf f{‘]VT) and estimate £ via the IS estimator 

^ E u, vt) . (2.9) 

i=l 

Note that in Steps 2-4 the optimal IS parameter is estimated. In the final 
step (Step 5) this parameter is used to estimate the probability of interest. 
Note also that the algorithm assumes availability of the parameters g (typi- 
cally between 0.01 and 0.1), N and Ni in advance. 

As an example, consider the case where the nominal parameter vector u is 
given by (0.25,0.4,0.1,0.3,0.2). Suppose we wish to estimate the probability 
that the minimum path is greater than 7 = 2 . Crude Monte Carlo with 10^ 
samples gave an estimate 1.65- 10~^ with an estimated relative error ^ RE, (that 

is, y^Var(^)/^) of 0.165. With 10® samples we got the estimate 1.30- 10~® with 
RE 0.03. 

Table 2.1 displays the results of the CE method, using N = 1000 and 
^ = 0.1. This table was computed in less than half a second. 



Table 2.1. Evolution of the sequence {( 7 t,Vt)}. 



t 


It 


1 vt 


0 




0.250 


0.400 


0.100 


0.300 


0.200 


1 


0.575 


0.513 


0.718 


0.122 


0.474 


0.335 


2 


1.032 


0.873 


1.057 


0.120 


0.550 


0.436 


3 


1.502 


1.221 


1.419 


0.121 


0.707 


0.533 


4 


1.917 


1.681 


1.803 


0.132 


0.638 


0.523 


5 


2.000 


1.692 


1.901 


0.129 


0.712 


0.564 




34 



2 A Tutorial Introduction to the Cross-Entropy Method 



Using the estimated optimal parameter vector of V 5 = (1.692, 1.901, 0.129, 
0.712, 0.564), the final step with Ni = 10^ now gave an estimate of 1.34 • 10“^ 
with an estimated RE of 0.03. The simulation time was only 3 seconds, using a 
Matlab implementation on a Pentium III 500 MHz processor. In contrast, the 
CPU time required for the CMC method with 10^ samples is approximately 
630 seconds, and with 10^ samples approximately 6350. We see that with a 
minimal amount of work we have reduced our simulation effort (CPU time) 
by roughly a factor of 625. 

2.2.2 A Combinatorial Optimization Example 

Consider a binary vector y = (2/1, • • • , 2 /n)- Suppose that we do not know 
which components of y are 0 and which are 1. However, we have an “oracle” 
which for each binary input vector x = (xi, . . . , Xn) returns the performance 
or response, 

n 

5(x) = n-'^\xj- yj\ , 

representing the number of matches between the elements of x and y. Our 
goal is to present a random search algorithm which reconstructs* (decodes) 
the unknown vector y by maximizing the function 5(x) on the space of n- 
dimensional binary vectors. 



X 




S{x) 



Fig. 2.2. A “device” for reconstructing vector y. 



A naive way is to repeatedly generate binary vectors X = (Xi, . . . ,Xn) 
such that Xi , . . . , Xn are independent Bernoulli random variables with suc- 
cess probabilities pi, . . . ,Pn- We write X ~ Ber(p), where p = (pi, . . . ,Pn)- 
Note that if p = y, which corresponds to the degenerate case of the Bernoulli 
distribution, we have 5(X) = n, X = y, and the naive search algorithm yields 
the optimal solution with probability 1. The CE method for combinatorial op- 
timization consists of creating a sequence of parameter vectors po, pi, . . . and 

* Of course, in this toy example the vector y can be easily reconstructed from the 
input vectors (0, 0, ... , 0), (1,0,..., 0), (0, 1, 0, . . . , 0), . . . , (0, . . . , 0, 1) only. 




2.2 Methodology: Two Examples 



35 



levels 7i, 72, • • • 5 such that 71,72, ••• , converges to the optimal performance 
(n here) and Po,Pij • • • converges to the optimal degenerated parameter vec- 
tor that coincides with y. Again, the CE procedure — which is similar to 
the r are-event procedure described in the CE algorithm in Section 2.2.1 — is 
outlined below, without detail. 

Algorithm 

1. Start with some po- Let t = 1. 

2. Draw a sample Xi, . . . , X^v of Bernoulli vectors with success probability 
vector pt_i. Calculate the performances 5(X^) for all z, and order them 
from smallest to biggest, 5(i) ^ ... ^ >S'(iv)- Let % be sample (1 — g)- 
quantile of the performances: jt = S'(|-(i_^)a/^-|). 

3. Use the same sample to calculate p^ = • • • ^Pt,n) via 




j = l,...,n, where X* = {Xu ,. . . , 

4. If the stopping criterion is met, then stop; otherwise set t t + 1 and 
reiterate from Step 2. 

A possible stopping criterion is to stop when 7^ does not change for a 
number of subsequent iterations. Another possible stopping criterion is to 
stop when the vector pt has converged to a degenerate — that is, binary 
— vector. Note that the interpretation of (2.10) is very simple: to update 
the j-th success probability we count how many vectors of the last sample 
Xi, . . . , Xiv have a performance greater than or equal to 7^ and have the j-th. 
coordinate equal to 1, and we divide this by the number of vectors that have 
a performance greater than or equal to 

As an example, consider the case where y = (1, 1, 1, 1, 1, 0, 0, 0, 0, 0). Using 
the initial parameter vector po = (1/2, 1/2, . . . , 1/2), and taking N = 50 and 
^ = 0.1, the algorithm above yields the results given in Table 2.2. We see that 
the Pt and % converge very quickly to the optimal parameter vector p* = y 
and optimal performance 7* = n, respectively. 



Table 2.2. Evolution of the sequence {(7t,Pt)}* 



t 


7t 




0 




0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


1 


7 


0.60 


0.40 


0.80 


0.40 


1.00 


0.00 


0.20 


0.40 


0.00 


0.00 


2 


9 


0.80 


0.80 


1.00 


0.80 


1.00 


0.00 


0.00 


0.40 


0.00 


0.00 


3 


10 


1.00 


1.00 


1.00 


1.00 


1.00 


0.00 


0.00 


0.00 


0.00 


0.00 


4 


10 


1.00 


1.00 


1.00 


1.00 


1.00 


0.00 


0.00 


0.00 


0.00 


0.00 




36 



2 A Tutorial Introduction to the Cross-Entropy Method 



Remark 2.1 (Likelihood ratio term). Note that the randomized optimization 
algorithm above is almost the same as the rare-event simulation algorithm in 
the previous Section 2.2.1. The most important difference is the absence of 
the likelihood ratio term W in Step 3. The reason is that for the optimization 
algorithm the choice of the initial parameter u is quite arbitrary^ so using W 
would be meaningless, while in rare-event simulation it is an essential part of 
the estimation problem. For more details see Remark 2.5. 



2.3 The CE Method for Rare-Event Simulation 

In this section we discuss the main ideas behind the CE algorithm for rare- 
event simulation. When reading this section, the reader is encouraged to refer 
back to the toy example presented in Section 2.2.1. 

Let X = (Xi, . . . , Xn) be a random vector taking values in some space X. 
Let {/(•; v)} be a family of probability density functions (pdfs) on X, with 
respect to some base measure /x, where v is a real- valued parameter (vector). 
Thus, 

EH{X)= [ if(x)/(x;v)/x(dx), 

Jx 

for any (measurable) function H. In most (or all) applications ji is either a 
counting measure or the Lebesgue measure. For simplicity, for the rest of this 
section we take jtx(dx) = dx. 

Let S be some real function on X. Suppose we are interested in the prob- 
ability that 5(X) is greater than or equal to some real number 7 — which we 
will refer to as level — under This probability can be expressed as 



^ — ^u(S^(X) ^ 7 ) — -^{5(X)^7} • 

If this probability is very small, say smaller than 10“^, we call {5(X) ^ 7} a 
rare event. 

A straightforward way to estimate £ is to use crude Monte-Carlo simula- 
tion: Draw a random sample Xi, . . . , X^v from /(• ; u); then 

1 ^ 

^ -f{S(X0^7} 

i=l 

is an unbiased estimator of £. However this poses serious problems when 
{5(X) ^ 7} is a rare event since a large simulation effort is required to esti- 
mate £ accurately, that is, with a small relative error or a narrow confidence 
interval. 

An alternative is based on importance sampling: take a random sample 
Xi,...,Xiv from an importance sampling (different) density g on Af, and 
estimate £ using the LR estimator (see (2.5)) 




2.3 The CE Method for Rare-Event Simulation 



37 






/(X,;u) 



( 2 . 11 ) 



The best way to estimate i is to use the change of measure with density 

_ -f{g(x)>7}/(x; u) 



5*(X) 

■c 

Namely, by using this change of measure we have in (2.11) 

/(Xi;u) 



( 2 . 12 ) 






5*(Xi) 






for all i\ see[148]. Since ^ is a constant, the estimator (2.11) has zero variance, 
and we need to produce only N = \ sample. 

The obvious difficulty is of course that this depends on the unknown 
parameter i. Moreover, it is often convenient to choose a ^ in the family of 
densities {/(*; v)}. The idea now is to choose the reference parameter (some- 
times called tilting parameter) v such that the distance between the density g* 
above and /(•; v) is minimal. A particularly convenient measure of “distance” 
between two densities g and h is the Kullhack-Leihler distance^ which is also 
termed the cross-entropy between g and h. The Kullback-Leibler distance is 
defined as: 



T>{g, h) = Eg In = J g{x) lng{x) dx- J g{x) In h{x) dx . (2.13) 

We note that D is not a “distance” in the formal sense; for example, it is not 
symmetric. 

Minimizing the Kullback-Leibler distance between g* in (2.12) and /(•; v) 
is equivalent to choosing v such that — J ^*(x) ln/(x; v) dx is minimized, 
which is equivalent to solving the maximization problem 

max J ^*(x) ln/(x; v) dx . (2.14) 

Substituting g* from (2.12) into (2.14) we obtain the maximization program 

max [ ") ^ (2.15) 

V J t 

which is equivalent to the program 

m^D{v) = m^Eu/{s(x)^7}ln/(X;v) , (2.16) 

where D is implicitly defined above. Again using importance sampling, with 
a change of measure /(•; w) we can rewrite (2.16) as 




38 



2 A Tutorial Introduction to the Cross-Entropy Method 



D(v) = Ew /{s(x)^7> ^ (X; u, w) In /(X; v) , (2.17) 



for any reference parameter w, where 

W^(x; u, w) 



/(x; u) 
/(x;w) 



is the likelihood ratio, at x, between /(-;u) and /(-;w). The optimal solution 
of (2.17) can be written as 



V* = argmax Ew /{5(x)^7> W (X; u, w) In /(X; v) . (2.18) 

V 



We may estimate v* by solving the following stochastic program (also called 
stochastic counterpart of (2.17)) 



maxD(v) 



1 

max — 
V N 



N 

ln/(X,;v) , 

i=l 



(2.19) 



where Xi, . . . ,X^r is a random sample from /(•; w). In typical applications 
the function D in (3.28) is convex and differentiable with respect to v [149], 
in which case the solution of (3.28) may be readily obtained by solving (with 
respect to v) the following system of equations: 



^ E As(x.)> 7} u, w) V In /(Xi; v) = 0 , (2.20) 

2=1 

The advantage of this approach is that the solution of (2.20) can often be 
calculated analytically. In particular, this happens if the distributions of the 
random variables belong to a natural exponential family (NEF). 

It is important to note that the CE program (2.19) is useful only if the 
probability of the “target event” {S'(X) ^ 7} is not too small under w, 
say greater than 10“^. For rare-event probabilities, however, the program 
(2.19) is difficult to carry out. Namely, due to the rareness of the events 
^ 7}, most of the indicator random variables I{s(Ki)^'y}^ i = 1, . . . ,N 
will be zero, for moderate N. The same holds for the derivatives of D{v) 
as given in the left-hand side of (2.20). A multilevel algorithm can be used 
to overcome this difficulty. The idea is to construct a sequence of reference 
parameters {v^, t ^ 0} and a sequence of levels {7^, t > 1}, and iterate in 
both 7t and Vt (see Algorithm 2.3.1 below). 

We initialize by choosing a not very small q, say q = 10“^ and by defining 
Vo = u. Next, we let 71 (71 < 7) be such that, under the original density 
/(x; u), the probability = Eu/{5(x)^7i} is at least q. We then let vi be 
the optimal CE reference parameter for estimating and repeat the last two 
steps iteratively with the goal of estimating the pair {£, v*}. In other words, 
each iteration of the algorithm consists of two main phases. In the first phase 
7t is updated, in the second is updated. Specifically, starting with vq = u 
we obtain the subsequent 7^ and Vt as follows: 




2.3 The CE Method for Rare- Event Simulation 



39 



1 . Adaptive updating of 7 ^. For a fixed Vt_i, let 7 t be a (1 — g)-quantile 
of 5(X) under vt_i. That is, jt satisfies 

Pv,_,(5(X)>7t)^^, (2.21) 

Pv,_,(5(X)^7,)^1-^, (2.22) 



where X ~ /(•; Vt_i). 

A simple estimator % of 7 ^ can be obtained by drawing a random sample 
Xi, . . . , Xjv from /(•; Vt_i), calculating the performances 5(X^) for all z, 
ordering them from smallest to biggest: 5(i) ^ ^ 5(jv) and finally, 

evaluating the sample (1 — ^)-quantile as 



7t = 'S^(r(i-e)JVD • (2.23) 

Note that 5(j) is called the j-th order- statistic of the sequence 5(Xi), 
. . . , 5 (Xa)^). Note also that jt is chosen such that the event {5(X) > 7 t} 
is not too rare (it has a probability of around g), and therefore updating 
the reference parameter via a procedure such as (2.23) is not void of 
meaning. 

2. Adaptive updating of v^. For fixed 7 t and v^-i, derive Vt from the 
solution of the following CE program 

m^£)(v) = in^Evt_i/{s(x)^ 7 *}W^(X;u,vt_i)ln/(X; v) . (2.24) 

The stochastic counterpart of (2.24) is as follows: for fixed % and Vt_i, 
derive Vt from the solution of following program 

1 ^ 

m^D(v) =m^ — y^/{S(x,)^ 7 .}W^(Xi;u,vt_i) ln/(Xj;v) . (2.25) 

i=l 

Thus, at the first iteration, starting with vq = u, to get a good estimate 
for vi, the target event is artificially made less rare by (temporarily) using a 
level 7 i which is chosen smaller than 7 . The value of vi obtained in this way 
will (hopefully) make the event {5(X) ^ 7 } less rare in the next iteration, so 
in the next iteration a value 72 can be used which is closer to 7 itself. The 
algorithm terminates when at some iteration t a level is reached which is at 
least 7 and thus the original value of 7 can be used without getting too few 
samples. 

As mentioned before, the optimal solutions of (2.24) and (2.25) can often 
be obtained analytically, in particular when /(x; v) belongs to a NEF. 

The above rationale results in the following algorithm. 




40 



2 A Tutorial Introduction to the Cross-Entropy Method 



Algorithm 2.3.1 (Main CE Algorithm for Rare-Event Simulation) 

1. Define vq = u. Set t = 1 (iteration = level counter). 

2. Generate a sample Xi,...,Xjv from the density /(-;vt_i) and compute 
the sample (1 — g) -quantile 7t of the performances according to (2.23), 
provided % is less than 7. Otherwise set 7^ = 7. 

3. Use the same sample Xi, . . . , X^v to solve the stochastic program (2.25). 
Denote the solution by Vt . 

4 . Ifjt<^, set t = t 1 and reiterate from Step 2. Else proceed with Step 

5. 

5. Estimate the rare- event probability I using the LR estimate 

^ 1 ^ 

^=nY1 ^{5(x0^7} u, vt) , (2.26) 

i=l 

where T denotes the final number of iterations (= number of levels used). 



Example 2.2. We return to the example in Section 2.2.1. In this case, from 
(2.1) we have 

9 , X ^7 1 

SO that the j-th equation of (2.20) becomes 



Wr(Xi;u,w) 

i=l 



therefore 






which leads to the updating formula (2.8). 

Actually, one can show that if the distributions belong to a NEF that is 
parameterized by the mean, the updating formula always becomes (2.27). 



Remark 2.3 (Static Simulation). The above method has been formulated for 
finite-dimensional random vectors only; this is sometimes referred to as static 
simulation. For infinite-dimensional random vectors or stochastic processes 
we need a more subtle treatment. We will not go into details here, but rather 
refer to [46, 15] and Chapter 3. The main point is that Algorithm 2.3.1 holds 
true without much alteration. 




2.4 The CE-Method for Combinatorial Optimization 

2.4 The CE-Method for Combinatorial Optimization 



41 



In this section we discuss the main ideas behind the CE algorithm for combi- 
natorial optimization. When reading this section, the reader is encouraged to 
refer to the toy example in Section 2.2.2. 

Consider the following general maximization problem: Let .T be a finite 
set of states^ and let 5 be a real- valued performance function on X. We wish 
to find the maximum of S over X and the corresponding state (s) at which 
this maximum is attained. Let us denote the maximum by 7*. Thus, 

5(x*) =7* = n^^S'(x) . (2.28) 

The starting point in the methodology of the CE method is to associate 
with the optimization problem (2.28) a meaningful estimation problem. To this 
end we define a collection of indicator functions {/{5(x)^7>} on X for various 
levels 7 G R. Next, let {/(*;v),v G V} be a family of (discrete) probability 
densities on X, parameterized by a real-valued parameter (vector) v. For a 
certain u G V we associate with (2.28) the problem of estimating the number 

£(l) = Pu(5'(X) ^ 7) = ^ = Eu-^{s(x)> 7} > (2-29) 

X 

where Pu is the probability measure under which the random state X has pdf 
/(•;u), and Eu denotes the corresponding expectation operator. We will call 
the estimation problem (2.29) the associated stochastic problem (ASP). To 
indicate how (2.29) is associated with (2.28), suppose for example that 7 is 
equal to 7* and that /(•; u) is the uniform density on X. Note that, typically, 
^(7*) — /(x*iu) = Vl^l — where \X\ denotes the number of elements in X 
— is a very small number. Thus, for 7 = 7* a natural way to estimate ^(7) 
would be to use the LR estimator (2.26) with reference parameter v* given 
by 

V* = argmax Eu /{s(x)^7> In /(X; v) . (2.30) 

V 

This parameter could be estimated by 
_ 1 ^ 

V* = argmax — ^ /{s(xo»7} /(^d v) , (2.31) 

^ i=l 

where the X^ are generated from pdf /(•; n). It is plausible that, if 7 is close 
to 7*, that /(•; V*) assigns most of its probability mass close to x*, and thus 
can be used to generate an approximate solution to (2.28). However, it is 
important to note that the estimator (2.31) is only of practical use when 
^{5(X)^7> = 1 foi* enough samples. This means for example that when 7 is 
close to 7*, u needs to be such that Pu(S'(X) > 7) is not too small. Thus, 
the choice of u and 7 in (2.28) are closely related. On the one hand we would 
like to choose 7 as close as possible to 7*, and find (an estimate of) v* via 




42 



2 A Tutorial Introduction to the Cross-Entropy Method 



the procedure above, which assigns almost all mass to state(s) close to the 
optimal state. On the other hand, we would like to keep 7 relative large in 
order to obtain an accurate (low RE) estimator for v*. 

The situation is very similar to the rare-event simulation case of Sec- 
tion 2.3. The idea, based essentially on Algorithm 2.3.1, is to adopt a two- 
phase multilevel approach in which we simultaneously construct a sequence 
of levels 7 i, 72 , • • • 5 7 r and parameter (vectors) vq, Vi, . . . , vt such that 7t is 
close to the optimal 7 * and \t is such that the corresponding density assigns 
high probability mass to the collection of states that give a high performance. 
This strategy is embodied in the following procedure; see for example [145]: 

Algorithm 2.4.1 (Main CE Algorithm for Optimization) 

1. Define vq = u. Set t = 1 (level counter). 

2. Generate a sample Xi, . . . ,Xjv from the density /(•; Vt_i) and compute 
the sample (1 — g)-quantile % of the performances according to (2.23). 

3. Use the same sample Xi, . . . , Xjv and solve the stochastic program (2.25) 
with W = 1. Denote the solution by v^. 

4- If for some t>d, say d = 5, 



7t = 7t-i = • • • = 7t-d , (2-32) 

then stop (let T denote the final iteration); otherwise set t = t 1 and 
reiterate from Step 2. 



Note that the initial vector vq, the sample size N, the stopping parameter d, 
and the number g have to be specified in advance, but that the rest of the 
algorithm is “self-tuning.” 

Remark 2.4 (Smoothed Updating). Instead of updating the parameter vector 
Vt_i to Vt directly via (2.31) we use a smoothed updating procedure in which 

vi = o Wt + (1 - o) Vt_i , (2.33) 

where is the vector derived via (2.25) with W = 1. This is especially 
relevant for optimization problems involving discrete random variables. The 
main reason why this heuristic smoothed updating procedure performs better 
is that it prevents the occurrences of Os and Is in the parameter vectors; once 
such an entry is 0 or 1, it often will remain so forever, which is undesirable. We 
found empirically that a value of a between 0.3 < o; < 0.9 gives good results. 
Clearly for a = 1 we have the original updating rule in Algorithm 2.4.1. 

In many applications we observed numerically that the sequence of pdfs 
/(•; Vq), /(•; vi), . . . converges to a degenerate measure (Dirac measure), as- 
signing all probability mass to a single state xt, for which, by definition, the 
function value is greater than or equal to 7 t- 




2.4 The CE-Method for Combinatorial Optimization 



43 



Remark 2.5 (Similarities and differences). Despite the great similarity be- 
tween Algorithm 2.3.1 and Algorithm 2.4.1, it is important to note that the 
role of the initial reference parameter u is significantly different. In Algo- 
rithm 2.3.1 u is the unique nominal parameter for estimating Pu(5(X) ^ 7). 
However, in Algorithm 2.4.1 the choice for the initial parameter u is fairly 
arbitrary; it is only used to define the ASP. In contrast to Algorithm 2.3.1 
the ASP for Algorithm 2.4.1 is redefined after each iteration. In particular, in 
Steps 2 and 3 of Algorithm 2.4.1 we determine the optimal reference parameter 
associated with ^ 7^), instead of Pu(S'(X) ^ 7^). Consequently, 

the likelihood ratio term W that plays a crucial role in Algorithm 2.3.1 does 
not appear in Algorithm 2.4.1. 

The above procedure can, in principle, be applied to any discrete and 
continuous optimization problem. For each individual problem two essential 
ingredients need to be supplied: 

1. We need to specify how the samples are generated. In other words, we 
need to specify the family of densities {/(•; v)}. 

2. We need to calculate the updating rules for the parameters, based on 
cross-entropy minimization. 

In general there are many ways to generate samples from A', and it is not 
always immediately clear which way of generating the sample will yield better 
results or easier updating formulas. 

Example 2.6. We return to the example from Section 2.2.2. In this case, the 
random vector X = (Xi, . . . , Xn) ~ Ber(p), and the parameter vector v is p. 
Consequently, the pdf is 



/(X;p) = , 



i=l 



and since each Xj can only be 0 or 1, 

d , , Xj l-Xj 1 

dpj pj 1-pj {l-Pj)Pj 



(Xj-Pj). 



Now we can find the maximum in (2.25) (with W = 1) by setting the first 
derivatives with respect to pj equal to zero, for j = 1 , . . . , n: 

d ^ 1 AT 

ln/(Xi;p) = i^ij-Pj) = 0 ■ 



Thus, we get 



„ _ E»=1 1{S(Xi)^^}Xij 
Pj - J 



which immediately implies (2.10). 



(2.34) 




44 



2 A Tutorial Introduction to the Cross-Entropy Method 



A number of remarks are now in order. 

Remark 2.7 (Single-Phase Versus Two-Phase Approach). Algorithm 2.4.1 is a 
two-phase algorithm. That is, at each iteration t both the reference parameter 
vt and the level parameter 7 t are updated. It is not difficult, using the same 
rationale as before, to formulate a single-phase CE algorithm. In particular, 
consider maximization problem (2.28). Let ip be any increasing “reward”- 
function of the performances. Let {/(•; v)} be a family of densities on X which 
contains the Dirac measure at x*. Then, solving problem (2.28) is equivalent 
to solving 

maxEvV?(S'(X)) , 

V 

or solving 

maxEuV^(S'(X)) ln/(X;v) , 

for any u. As before we construct a sequence of parameter vectors vq = 
u? vi, V 2 , . . ., such that 

N 

Vt = axgmax ^(^(S'(Xi)) ln/(Xj;v) , 

'' i=l 

where Xi,...,Xiv is a random sample from /(-;vt_i). A reward function 
without a level parameter 7 would simplify Algorithm 2.4.1 substantially. 
The disadvantage of using this approach is that, typically, it takes too long 
for Algorithm 2.4.1 to converge, since the large number of “not important” 
vectors slow down dramatically the convergence of {v^} to v* corresponding 
to the Dirac measure at x*. We found numerically that the single-phase CE 
algorithm is much worse than its two-phase counterpart in Algorithm 2.4.1, in 
the sense that it is less accurate and more time consuming. Hence it is crucial 
for the CE method to use both phases, that is, follow Algorithm 2.4.1. This is 
also one of the major differences between CE and ant-based methods, where a 
single-phase procedure (updating of Vt alone, no updating of 7 t) is used [49]. 

Remark 2.8 (Maximum Likelihood Estimation). It is interesting to note the 
connection between (2.25) and maximum likelihood estimation (MLE). In the 
MLE problem we are given data Xi,...,Xiv which are thought to be the 
outcomes of i.i.d. random variables Xi, . . . , Xjv (random sample) each having 
a distribution /(•; v), where the parameter (vector) v is an element of some set 
V. We wish to estimate v on the basis of the data Xi, . . . , x^y- The maximum 
likelihood estimate (MLE) is that parameter v which maximizes the joint 
density of Xi, . . . , Xjv for the given data Xi, . . . , Xjv- In other words, 

N 

V = argmax /(x*; v) . 

The corresponding random variable, obtained by replacing x^ with X^ is called 
the maximum likelihood estimator (MLE as well), also denoted by v. Since 
In(-) is an increasing function, we have 




2.5 Two Applications 



45 



N 

V = argmax ^ In /(X^; v) . (2.35) 

2=1 

Solving (2.35) is similar to solving (2.25). The only differences are the indicator 
function /{5(Xi)^7> the likelihood ratio W. For VF = 1 we can write Step 
3 in Algorithm 2.4.1 as 



Vt = argmax ^ ln/(Xj;v) . 
Xi:S(Xi)>7t 



In other words, Vt is equal to the MLE of Vt_i based only on the vectors 
in the random sample for which the performance is greater than or equal to 
For example, in Example 2.6 the MLE of pj based on a random sample 
Xi, . . . ,Xat is 



Pj 



N 



Thus, if we base the MLE only on those vectors that have performance greater 
than or equal to 7, we obtain (2.34) immediately. 



Remark 2.9 (Parameters). The choice for the sample size N and the parameter 
Q depends on the size of the problem and the number of parameters in the 
ASP. In particular, for a SNN-type problem it is suggested to take the sample 
size as N = cn, where n is the number of nodes and c a constant (c > 1), 
say 5 < c < 10. In contrast, for a SEN-type problem it is suggested to take 
N = where is the number of edges in the network. It is crucial to 
realize that the sample sizes N = cn and N = cn^ (with c > 1) are associated 
with the number of ASP parameters (n and n^) that one needs to estimate 
for the SNN and SEN problems, respectively (see also the max-cut and the 
TSP examples below) . Clearly, in order to estimate k parameters reliably, one 
needs to take at least a sample N = ck for some constant c > 1. Regarding 
g, it is suggested to take g around 0.01, provided n is reasonably large, say 
n > 100; and it is suggested to take a larger g, say g « ln(n)/n, for n < 100. 

Alternatively, the choice of N and g can be determined adaptively. For ex- 
ample, in [85] an adaptive algorithm is proposed that adjusts N automatically. 
The FACE algorithm discussed in Chapter 5 is another option. 



2.5 Two Applications 

In this section we discuss the application of the CE method to two combina- 
torial optimization problems, namely the max-cut problem, which is a typical 
SNN-type problem; and the travelling salesman problem, which is a typical 
SEN-type problem. We demonstrate the usefulness of the CE method and 
its fast convergence in a number of numerical examples. We further illustrate 




46 



2 A Tutorial Introduction to the Cross-Entropy Method 



the dynamics of the CE method and show how fast the reference parame- 
ters converge. For a more in-depth study of the max-cut problem and the 
related partition problem we refer to Sections 4.5 and 4 . 6 . Similarly, the TSP 
is discussed in more detail in Section 4 . 7 . 



2 . 5.1 The Max-Cut Problem 

The max-cut problem in a graph can be formulated as follows. Given a 
weighted graph GiV^E) with node set V = { 1 , . . . ,n} and edge set E, par- 
tition the nodes of the graph into two subsets Vi and V2 such that the sum 
of the weights of the edges going from one subset to the other is maximized. 
We assume the weights are nonnegative. We note that the max-cut problem is 
an NP-hard problem. Without loss of generality, we assume that the graph is 
complete. For simplicity we assume the graph is not directed. We can repre- 
sent the possibly zero edge weights via a nonnegative, symmetric cost matrix 
C = {cij) where cij denotes the weight of the edge from i to j. 

Formally, a cut is a partition {Vi, V2} of V. For example ifV = { 1 , . . . , 6}, 
then {{ 1 , 3 , 4 }, { 2 , 5 , 6 }} is a possible cut. The cost of a cut is sum of the 
weights across the cut. As an example, consider the following 6x6 cost matrix 

^0 C12 ci3 0 0 0 ^ 

C21 0 C23 C24 0 0 

^ _ C31 C32 0 C34 C35 0 ^2 30') 

0 C42 C43 0 C45 C46 

0 0 C53 C54 0 C56 

yo 0 0 C64 C65 0 J 
For example the cost of the cut {{ 1 , 3 , 4 }, { 2 , 5 , 6}} is 

C12 H" C32 4 - C42 + C45 + C53 + C46 . 

It will be convenient to represent a cut via a cut vector x = {x \^ . . . , 
where = 1 if node i belongs to same partition as 1 , and 0 otherwise. By 
definition x\ = 1 . For example, the cut {{ 1 , 3 , 4 }, { 2 , 5 , 6}} can be represented 
via the cut vector (1, 0, 1, 1, 0, 0). 

Let X be the set of all cut vectors x = ( 1 , X2, . . . , x^) and let 5 (x) be the 
corresponding cost of the cut. We wish to maximize S via the CE method. 
Thus, (a) we need to specify how the random cut vectors are generated, and 
(b) calculate the corresponding updating formulas. The most natural and 
easiest way to generate the cut vectors is to let X2, . . . be independent 
Bernoulli random variables with success probabilities P2, • • • exactly as in 
the second toy example; see Section 2 . 2 . 2 . 




2.5 Two Applications 



47 



It immediately follows (see Example 2.6) that the updating formula in 
Algorithm 2.4.1 at the t-th iteration is given by 



Pt,i = 



T,k=l hXki = l} 

Efe=l -f{5(Xfc)>7t} 



(2.37) 



2 = 2, . . . , n. 

A Synthetic Max- Cut Problem 

Since the max-cut problem is NP hard [57, 125], no efficient method for solv- 
ing the max-cut problem exists. The naive total enumeration routine is only 
feasible for small graphs, say for those with n < 30 nodes. Although the 
branch-and-bound heuristic can solve medium size problems exactly, it too 
will run into problems when n becomes large. 

In order to verify the accuracy of the CE method we construct an artificial 
graph such that the solution is available in advance. In particular, for m G 
{1, . . . , n} consider the following symmetric cost matrix: 

where Z\\ is an m x m symmetric matrix in which all the upper-diagonal 
elements are generated from a U(a, 6) distribution (and all lower-diagonal 
elements follow by symmetry) , Z 22 is a (n — m) x (n — m) symmetric matrix 
which is generated in a similar way as Zn, and all the other elements are c, 
apart from the diagonal elements, which are of course 0. 

It is not difficult to see that for c > b(n — m) I'm the optimal cut is given, 
by construction, by V* = with 

Fi* = {!,..., m} and ^ 2 * = {m +1, . . . , n} , (2.39) 

and the optimal value of the cut is 



= cm {'n — m) . (2.40) 

Of course a similar optimal solution and optimal value can be found for the 
general case where the elements in Z\i and Z 22 are generated via an arbitrary 
bounded support distribution with the maximal value of the support less than 

c. 

Table 2.3 lists a typical output of Algorithm 2.4.1 applied to the synthetic 
max-cut problem, for a network with n = 400 nodes. The convergence of the 
reference vectors {pt} to the optimal p* is further illustrated in Figures 2.3 
and 2.4. 




48 



2 A Tutorial Introduction to the Cross-Entropy Method 



In this table besides the (1 — ^)-quantile of the performances we also 
list the best of the performances in each iteration, denoted by siiid the 

Euclidean distance 

l|pt - p*ll = , 

as a measure of how close the reference vector is to the optimal reference 
vector p* = (1, 1, . . . , 1, 0, 0, . . . , 0). 

In this particular example, we took m = 200 and generated Zn and Z 22 
from U(0, 1) distribution; and the elements in B \2 (and ^ 21 ) are constant 
c = 1. The CE parameters were chosen as follows: rarity parameter ^ = 0.1; 
smoothing parameter a = 1.0 (no smoothing); stopping constant d = 3; and 
number of samples per iteration N = 1000. 

The CPU time was only 100 seconds, using a Matlab implementation on 
a Pentium III, 500 Mhz processor. We see that the CE algorithm converges 
quickly, yielding the exact optimal solution 40000 in less than 23 iterations. 



Table 2.3. A typical evolution of Algorithm 2.4.1 for the synthetic max-cut problem 
with n = 400, d = 3, q = 0.1, a = 1.0, N = 1000. 



t 


7t 


St,(N) 


l|Pt-p1l 


1 


30085.3 


30320.9 


9.98 


2 


30091.3 


30369.4 


10.00 


3 


30113.7 


30569.8 


9.98 


4 


30159.2 


30569.8 


9.71 


5 


30350.8 


30652.9 


9.08 


6 


30693.6 


31244.7 


8.37 


7 


31145.1 


31954.8 


7.65 


8 


31711.8 


32361.5 


6.94 


9 


32366.4 


33050.3 


6.27 


10 


33057.8 


33939.9 


5.58 


11 


33898.6 


34897.9 


4.93 


12 


34718.9 


35876.4 


4.23 


13 


35597.1 


36733.0 


3.60 


14 


36368.9 


37431.7 


3.02 


15 


37210.5 


38051.2 


2.48 


16 


37996.7 


38654.5 


1.96 


17 


38658.8 


39221.9 


1.42 


18 


39217.1 


39707.8 


1.01 


19 


39618.3 


40000.0 


0.62 


20 


39904.5 


40000.0 


0.29 


21 


40000.0 


40000.0 


0.14 


22 


40000.0 


40000.0 


0.00 


23 


40000.0 


40000.0 


0.00 




49 




















5 ( 






























2.5 Two Applications 



51 



2.5.2 The Travelling Salesman Problem 

The travelling salesman problem (TSP) can be formulated as follows: Consider 
a weighted graph G with n nodes, labeled 1,2, The nodes represent 

cities, and the edges represent the roads between the cities. Each edge from i 
to j has weight or cost Cij, representing the length of the road. The problem 
is to find the shortest tour that visits all the cities exactly once^ (except the 
starting city, which is also the terminating city); see Figure 2.5. 




Fig. 2.5. Find the shortest tour x visiting all nodes. 

Without loss of generality, let us assume that the graph is complete^ and 
that some of the weights may be +oo. Let X be the set of all possible tours 
and let 5(x) the total length of tour x G Af. We can represent each tour via a 
permutation of (1, . . . , n). For example for n = 4, the permutation (1, 3, 2, 4) 
represents the tour 1— )^3^2— )>4-^l. In fact, we may as well represent 
a tour via a permutation x = (xi, . . . ^Xn) with x\ = 1. From now on we 
identify a tour with its corresponding permutation, where x\ = 1. We may 
now formulate the TSP as follows. 



min5(x) 

xgat 



min 

xGA’ 



^ n—1 

E' 

< i=l 




Note that the number of elements in X is typically very large: 



\X\ = in -1)1 



(2.41) 



(2.42) 



This is exactly the setting of Section 2.4, so we can use the CE method to 
solve (2.41). Note however that we need to modify Algorithm 2.4.1 since we 
have here a minimization problem. 



^ In some versions of the TSP, cities can be visited more than once. 



52 



2 A Tutorial Introduction to the Cross-Entropy Method 



In order to apply the CE algorithm we need to specify (a) how to generate 
the random tours, and (b) how to update the parameters at each iteration. 

The easiest way to explain how the tours are generated and how the param- 
eters are updated is to relate (2.41) to an equivalent minimization problem. 
Let 



X — {(^l ) • • • ) ^n) • ^1 1? C "(l, . . . , 77- j- , i 2, . . . , 72 )■ , 

be the set of vectors that correspond to paths of length n that start in 1 and 
can visit the same city more than once. Note that \X\ = and X <Z X. 

When 72 = 4, we have for example x = (1, 3, 1, 3) C X^ corresponding to the 
path >>1. Define the function 5 on ^ by S{x) = 5(x), if 

X G A' and 5(x) = oo, otherwise. Then, obviously (2.41) is equivalent to the 
minimization problem 

minimize 5(x) over x. E X . (2.43) 

A simple method to generate a random path X = (Xi, . . . , Xn) in X is to use 
a Markov chain on the graph G, starting at node 1, and stopping after n steps. 
Let P = {pij) denote the one-step transition matrix of this Markov chain. We 
assume that the diagonal elements of P are 0, and that all other elements of 
P are strictly positive, but otherwise P is a general nxn stochastic matrix. 

The pdf /(•; P) of X is thus parameterized by the matrix P and its loga- 
rithm is given by 

n 

ln/(x;P) = SE 

r=l ij 



where Xij (r) is the set of all paths in X for which the r-th transition is from 
node i to j. The updating rules for this modified optimization problem follow 
from (2.25) {W = 1), with {5(Xi) ^ 7} replaced with {5(X^) ^ 7}, under 
the condition that the rows of P sum up to 1. Using Lagrange multipliers 
7/1 , . . . , TXn we obtain the maximization problem 

(n 

Ui I ^^Pij ~~ 1 
\i=l 

DiflFerentiating the expression within square brackets above with respect to 
Pij^ yields, for all j = 1, ... , 72 , 

n 

XI (r)} 

— h 7/i == 0 . (2.45) 

Pij 

Summing over j = 1, ... ,72 gives Er=i = -'^i^ where 

Xi{r) is the set of paths for which the r-th transition starts from node i. It 
follows that the optimal Pij is given by 




max mm 

P Ui,...,Un 






i-1 




2.5 Two Applications 



53 



Ep/ 






Pij = 



{^(X)^7}Z^MX€A’,,(r)} 
r=l 



(2.46) 



®^’-^{5(XK7} 13 ^{X€.V,(r)} 
The corresponding estimator is 



r=l 



AT n 

13 -^{S(Xfc)^ 7 > 13 ^{Xfc€.Vy(r)} 

^ fc=l r=l 

“ “iv 

13 -^{S(Xfc)^7} 13 ^{Xfc€.$i(r)} 

fc=l r—1 



(2.47) 



This has a very simple interpretation. To update pij we simply take the frac- 
tion of times that the transitions from i to j occurred, taking into account 
only those paths that have a total length less than or equal to 7. 

This is how we could, in principle^ carry out the sample generation and 
parameter updating for problem (2.43). We generate the path via a Markov 
process with transition matrix P, and use updating formula (2.47). However, 
in practice^ we would never generate the paths in this way, since the majority 
of these tours would be irrelevant since they would not constitute a tour, 
and therefore their S values would be 00. In order to avoid the generation of 
irrelevant tours, we proceed as follows. 

Algorithm 2.5.1 ( Generation of permutations (tours) in the TSP) 

1. Define = P and Xi = 1. Let k = 1. 

2. Obtain p(^+^) from P^^^ by first setting the X^-th column of P^^^ to 0 
and then normalizing the rows to sum up to 1. Generate Xk-\-i from the 
distribution formed by the Xk~th row of 

3. If k = n — 1 then stop; otherwise set k = k-\-l and reiterate from Step 2. 

It is important to realize that the updating formula remains the same; by 
using Algorithm 2.5.1 we are merely speeding up our naive way of generating 
the paths. Moreover, since we now only generate tours, the updated value for 
Pij can be estimated as 



53 -f{S(Xfc)^7} T'lXfcGA’y} 

Pij = > ( 2 - 48 ) 

53^{-5(X0<7} 

k=l 



where Xij is the set of tours in which the transition from i to j is made. This 
has the same “natural” interpretation as discussed for (2.47). 




54 



2 A Tutorial Introduction to the Cross-Entropy Method 



To complete the algorithm, we need to specify the initialization conditions 
and the stopping criterion. For the initial matrix Pq we could simply take all 
off-diagonal elements equal to l/(n — 1) and for the stopping criterion use 
formula (2.32). 

Numerical Examples 

To demonstrate the usefulness of the CE algorithm and its fast and accurate 
convergence we provide a number of numerical examples. The first example 
concerns the benchmark TSP ft 53 taken from the URL 

http : //www . iwr . uni-heidelberg . de/groups/comopt/sof tware/TSPLIB95/atsp/ 

Table 2.4 presents a typical evolution of the CE Algorithm for the problem 
ft 53, which defines an asymmetric fully connected graph of size 53, where the 
cost of each edge Cij is given. The CE parameters were: stopping parameter 
d = 5, rarity parameter g = 0.01, sample size AT = lOn^ = 28090, and 
smoothing parameter a = 0.7. The relative experimental error of the solution 
is ^ ^ 

£ = = 0.015 , (2.49) 

where 7* = 6905 is the best known solution. The CPU time was approximately 
6 minutes. In Table 2.4, denotes the length of smallest tour in iteration 
t. We also included the quantity which is the minimum of the maximum 

elements in each row of matrix Pt. 



Table 2.4. A typical evolution of Algorithm 2.4.1 for the TSP ft53 with n = 53 
nodes, d = 5, g = 0.01, a = 0.7, N = 10 = 28090. 



t 


7t 




pmm 




t 


7t 




pmm 


1 


23234.00 


21111.00 


0.0354 




17 


9422.00 


8614.00 


0.1582 


2 


20611.00 


18586.00 


0.0409 




18 


9155.00 


8528.00 


0.1666 


3 


18686.00 


16819.00 


0.0514 




19 


8661.00 


7970.00 


0.1352 


4 


17101.00 


14890.00 


0.0465 




20 


8273.00 


7619.00 


0.1597 


5 


15509.00 


13459.00 


0.0698 




21 


8096.00 


7485.00 


0.1573 


6 


14449.00 


12756.00 


0.0901 




22 


7868.00 


7216.00 


0.1859 


7 


13491.00 


11963.00 


0.0895 




23 


7677.00 


7184.00 


0.2301 


8 


12773.00 


11326.00 


0.1065 




24 


7519.00 


7108.00 


0.2421 


9 


12120.00 


10357.00 


0.0965 




25 


7420.00 


7163.00 


0.2861 


10 


11480.00 


10216.00 


0.1034 




26 


7535.00 


7064.00 


0.3341 


11 


11347.00 


9952.00 


0.1310 




27 


7506.00 


7072.00 


0.3286 


12 


10791.00 


9525.00 


0.1319 




28 


7199.00 


7008.00 


0.3667 


13 


10293.00 


9246.00 


0.1623 




29 


7189.00 


7024.00 


0.3487 


14 


10688.00 


9176.00 


0.1507 




30 


7077.00 


7008.00 


0.4101 


15 


9727.00 


8457.00 


0.1346 




31 


7068.00 


7008.00 


0.4680 


16 


9263.00 


8424.00 


0.1436 














2.5 Two Applications 



55 



Similar performances were found for other TSPs in the benchmark library 
above. Table 2.5 presents the performance of Algorithm 2.4.1 for a selection 
of case studies from this library. In all numerical results we use the same CE 
parameters as for the ft53 problem, that is ^ = 10“^, N = 10 a = 0.7 
(smoothing parameter in (2.33)) and d = 5 (in (2.32)). To study the variability 
in the solutions, each problem was repeated 10 times. In Table 2.5, n denotes 
the number of nodes of the graph, T denotes the average total number of 
iterations needed before stopping, 71 and jt denote the average initial and 
final estimates of the optimal solution, 7* denotes the best known solution, 
e denotes the average relative experimental error based on 10 replications, 
and denote the smallest and the largest relative error among the 10 
generated shortest paths, and finally CPU denotes the average CPU time in 
seconds. We found that decreasing the sample size N from N = 10 to 
N = bv? all relative experimental errors e in Table 2.5 increase at most by a 
factor of 1.5. 



Table 2.5. Case studies for the TSP. 



file 


n 


T 


71 


7t 


7* 


s 


e* 


£* 


CPU 


brl7 


17 


23.8 


68.2 


39.0 


39 


0.000 


0.000 


0.000 


9 


ftv33 


34 


31.2 


3294.0 


1312.2 


1286 


0.020 


0.000 


0.062 


73 


ftv35 


36 


31.5 


3714.0 


1490.0 


1473 


0.012 


0.004 


0.018 


77 


ftv38 


39 


33.8 


4010.8 


1549.8 


1530 


0.013 


0.004 


0.032 


132 


p43 


43 


44.5 


9235.5 


5624.5 


5620 


0.010 


0.000 


0.001 


378 


ftv44 


45 


35.5 


4808.2 


1655.8 


1613 


0.027 


0.013 


0.033 


219 


ftv47 


48 


40.2 


5317.8 


1814.0 


1776 


0.021 


0.006 


0.041 


317 


ry48p 


48 


40.8 


40192.0 


14845.5 


14422 


0.029 


0.019 


0.050 


345 


ft53 


53 


39.5 


20889.5 


7103.2 


6905 


0.029 


0.025 


0.035 


373 


ftv55 


56 


40.0 


5835.8 


1640.0 


1608 


0.020 


0.002 


0.043 


408 


ftv64 


65 


43.2 


6974.2 


1850.0 


1839 


0.006 


0.000 


0.014 


854 


ftv70 


71 


47.0 


7856.8 


1974.8 


1950 


0.013 


0.004 


0.037 


1068 


ft70 


70 


42.8 


64199.5 


39114.8 


38673 


0.011 


0.003 


0.019 


948 



Dynamics 

Finally, as an illustration of the dynamics of the CE algorithm, we display be- 
low the sequence of matrices Pq, Pi, . . Aor a. TSP with n=10 cities, where the 
optimal tour is (1 , 2, 3, ... , 10, 1). A graphical illustration of the convergence 
is given in Figure 2.6, where we omitted Pq whose off-diagonal elements are 
all equal to 1/9 and diagonal elements equal to 0. 




56 



2 A Tutorial Introduction to the Cross-Entropy Method 



Pi 



P2 



Ps 



P4 



P5 



/O.OO 0.31 0.04 0.08 0.04 0.19 0.08 0.08 0.12 0.08 \ 

0.04 0.00 0.33 0.08 0.17 0.08 0.08 0.04 0.04 0.12 

0.08 0.08 0.00 0.23 0.04 0.04 0.12 0.19 0.08 0.15 

0.12 0.19 0.08 0.00 0.12 0.08 0.08 0.08 0.19 0.08 
0.08 0.08 0.19 0.08 0.00 0.23 0.08 0.04 0.15 0.08 
0.04 0.04 0.08 0.04 0.12 0.00 0.50 0.08 0.08 0.04 
0.23 0.08 0.08 0.04 0.08 0.04 0.00 0.27 0.08 0.12 
0.08 0.15 0.04 0.04 0.19 0.08 0.08 0.00 0.27 0.08 
0.08 0.08 0.04 0.12 0.08 0.15 0.08 0.04 0.00 0.35 
\0.21 0.08 0.17 0.08 0.04 0.12 0.08 0.12 0.08 0.00 

/O.OO 0.64 0.03 0.06 0.04 0.04 0.06 0.04 0.04 0.06 \ 
0.03 0.00 0.58 0.07 0.07 0.05 0.05 0.03 0.03 0.08 
0.05 0.05 0.00 0.52 0.04 0.03 0.08 0.04 0.05 0.15 
0.04 0.13 0.05 0.00 0.22 0.18 0.05 0.04 0.25 0.05 
0.06 0.04 0.09 0.04 0.00 0.60 0.04 0.03 0.04 0.06 
0.03 0.03 0.05 0.03 0.04 0.00 0.71 0.05 0.05 0.03 
0.20 0.04 0.05 0.03 0.05 0.03 0.00 0.51 0.05 0.04 
0.05 0.08 0.03 0.04 0.23 0.05 0.05 0.00 0.42 0.05 
0.05 0.05 0.04 0.07 0.07 0.10 0.05 0.03 0.00 0.54 
\0.50 0.05 0.04 0.05 0.04 0.08 0.05 0.14 0.05 0.00 

/O.OO 0.76 0.02 0.04 0.03 0.03 0.04 0.03 0.03 0.04 \ 
0.02 0.00 0.73 0.05 0.05 0.04 0.04 0.02 0.02 0.05 
0.03 0.03 0.00 0.70 0.02 0.02 0.05 0.02 0.03 0.09 
0.02 0.07 0.03 0.00 0.59 0.10 0.03 0.02 0.13 0.03 
0.04 0.03 0.06 0.03 0.00 0.73 0.03 0.02 0.03 0.04 
0.02 0.02 0.04 0.02 0.03 0.00 0.79 0.04 0.04 0.02 
0.12 0.02 0.03 0.02 0.03 0.02 0.00 0.69 0.03 0.02 
0.03 0.05 0.02 0.02 0.14 0.03 0.03 0.00 0.66 0.03 
0.03 0.03 0.02 0.05 0.05 0.06 0.03 0.02 0.00 0.71 
Vo.69 0.03 0.02 0.03 0.02 0.05 0.03 0.09 0.03 0.00 

/O.OO 0.82 0.01 0.03 0.02 0.02 0.03 0.02 0.02 0.03 \ 
0.01 0.00 0.80 0.03 0.03 0.03 0.03 0.01 0.01 0.04 
0.02 0.02 0.00 0.79 0.02 0.01 0.03 0.02 0.02 0.07 
0.01 0.04 0.02 0.00 0.73 0.06 0.02 0.01 0.09 0.02 
0.03 0.02 0.04 0.02 0.00 0.81 0.02 0.01 0.02 0.03 
0.01 0.01 0.03 0.01 0.02 0.00 0.84 0.03 0.03 0.01 
0.09 0.02 0.02 0.01 0.02 0.01 0.00 0.78 0.02 0.02 
0.02 0.03 0.01 0.02 0.09 0.02 0.02 0.00 0.76 0.02 
0.02 0.02 0.02 0.03 0.03 0.05 0.02 0.01 0.00 0.79 
\0.78 0.02 0.02 0.02 0.02 0.03 0.02 0.06 0.02 0.00 

/O.OO 0.86 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 \ 
0.01 0.00 0.85 0.03 0.03 0.02 0.02 0.01 0.01 0.03 
0.02 0.02 0.00 0.84 0.01 0.01 0.03 0.01 0.02 0.05 
0.01 0.03 0.01 0.00 0.80 0.05 0.01 0.01 0.06 0.01 
0.02 0.02 0.03 0.02 0.00 0.85 0.02 0.01 0.02 0.02 
0.01 0.01 0.02 0.01 0.02 0.00 0.88 0.02 0.02 0.01 
0.06 0.01 0.02 0.01 0.02 0.01 0.00 0.84 0.02 0.01 
0.02 0.02 0.01 0.01 0.07 0.02 0.02 0.00 0.82 0.02 
0.02 0.02 0.01 0.02 0.02 0.03 0.02 0.01 0.00 0.84 
Vo.84 0.02 0.01 0.02 0.01 0.03 0.02 0.05 0.02 0.00 / 




2.6 Exercises 



57 




Fig. 2.6. Convergence of the reference parameter (matrix) for a 10 node TSP. 



2.6 Exercises 

1. Implement and repeat the rare-event simulation toy example correspond- 
ing to Table 2.1. 

2. Implement and repeat the combinatorial optimization toy example corre- 
sponding to Table 2.2. 

3. Extend the program used in Exercise 2 to include smoothed updating, and 
apply the program to a larger example, say n = 50. Observe if and how 
the choice of parameters affects the accuracy and speed of the algorithm. 




58 2 A Tutorial Introduction to the Cross-Entropy Method 

4. Consider the one-phased analog of the two-phased CE program in Exer- 
cise 3; see Remark 2.7. Use the reward function (p{s) = s. 

a) Show that the updating formulas for the pj’s become: 

Eili5(x,) 

b) Compare numerically the performances of the one- and two-phased 
algorithms. 

5. Verify (2.40) and show that c > b{n — m)m is a sufficient condition for V* 
in (2.39) to be the optimal cut. 

6. In the famous n-queens problem^ introduced by Gauss in 1850, the objec- 
tive is to position n queens on an n x n chess board such that no one 
queen can be taken by any other. Write a computer program that solves 
the 8-queens problem using the CE method. We may assume that the 
queens occupy diflPerent rows. The positions of the queens may thus be 
represented as a vector x G {1, . . . ,8}®, where Xi indicates the column 
that the queen of row i occupies. For example, the configuration of Fig- 
ure 2.7 corresponds to x = (2, 3, 7, 4, 8, 5, 1, 6)). A straightforward way 
to generate random vectors X is to draw each Xi independently from a 
probability vector {pn, . . . ,pig)? ^ = 1? • • • ? 8. The performance function S 
could be chosen such that it represent that number of times the queens can 
attack each other. That is, the sum of the number of queens minus one, 
in each row, column and diagonal. In Figure 2.7, 5(x) = 1. The updating 
formulas for the pij are easy. Excluding symmetry, there are 12 different 
solutions to the problem. Find them all, by running your algorithm several 
times. Take N = 500, a = 0.7 and g = 0.1. 























■w 






















. 


















. 




























1^; 

1 ** 


























i 






j. . 




j 

1 





Fig. 2.7. Position the 8 queens such that no queen can attack another. 





3 



Efficient Simulation via Cross-Entropy 



3.1 Introduction 

The performance of modern systems, such as coherent reliability systems, 
inventory systems, insurance risk, storage systems, computer networks and 
telecommunication networks is often characterized by probabilities of rare 
events and is frequently studied through simulation. Analytical solutions or 
accurate approximations for such rare-event probabilities are only available 
for a very restricted class of systems. Consequently, one often has to resort to 
simulation. However, estimation of rare-event probabilities with crude Monte 
Carlo techniques requires a prohibitively large numbers of trials, as illustrated 
in Example 1.3. Thus, new techniques are required for this type of problem. 
Two methods, called splitting /RESTART and importance sampling (IS) have 
been extensively investigated by the simulation community in the last decade. 

The basic idea of splitting proposed by Kahn and Harris [93] is to partition 
the state space of the system into a series of nested subsets and to consider the 
rare event as the intersection of a nested sequence of events. When a given 
subset is entered by a sample trajectory during the simulation, numerous 
random retrials are generated with the initial state for each retrial being the 
state of the system at the entry point. By doing so, the system trajectory 
is split into a number of new subtrajectories, hence the name “splitting.” A 
similar idea has been developed by Villen- Altamarino and Villen- Altamarino 
[166, 167] into a refined simulation technique under the name RESTART which 
has been extended by different authors [58, 59, 63, 64, 65, 72, 71, 79, 151, 152] 
to the multiple threshold case. 

In this chapter, however, we focus on importance sampling techniques. The 
main idea of IS [68, 160] is to make the occurrence of rare events more frequent 
by carrying out the simulation under a different probability distribution — 
the so-called change of measure (CoM) — and to estimate the probability 
of interest via a corresponding likelihood ratio (LR) estimator. The aim is 
to select a CoM that minimizes the variance of the LR estimator. It is well 
known that, in theory, there exists a CoM that yields a zero-variance LR 




60 



3 Efficient Simulation via Cross-Entropy 



estimator. However, in practice such an optimal CoM cannot be computed 
since it depends on the underlying quantity /quantities being estimated. 

Prominent among the CoMs is the exponential change of measure (ECM) . 
Recall, see (1.6), that here instead of the original pdf /(x), the simulation 
is carried out under an “exponentially twisted” pdf fe{x) = ce^^/(x), where 
6 is the twisting parameter. ECM often yields efficient IS estimates; see for 
example Sadowsky [150] and Asmussen and Rubinstein [18], but is usually 
feasible only for relative simple models; see also [82, 101, 154]. For such models 
the (asymptotic) optimal twisting parameter 6* is often derived via the theory 
of large deviations [32]. 

An alternative approach to ECM is to use an IS pdf, say /(x;v), which 
belongs to the same parametric family as the original distribution (also called 
the nominal distribution), say /(x;u). We shall call such an approach the 
standard likelihood ratio (SLR) approach. Similar to ECM, the SLR approach 
typically does not lead to the optimal zero- variance estimator, but yields sig- 
nificant variance reduction; see for instance [148] and below. The advantage 
of such an approach is that (a) it can be applied to rather general models, 
and (b) the optimal reference parameter v* of the IS density /(x; v) can be 
derived with standard optimization techniques. 

In this chapter, which is partly based on [85] and [103], we give a detailed 
description of the CE method, and explain how it can be used efficiently in 
importance sampling simulation, especially with regard to rare events. The 
CE method was proposed in [144] as an adaptive IS algorithm for rare-event 
simulation, in which the reference parameter v* is estimated by minimizing the 
sample variance of the SLR estimator. The proposed algorithm is called the 
variance minimization (VM) algorithm. In [145] this IS algorithm was further 
modified to minimize, instead of the sample variance, the sample Kullback- 
Leibler distance, or cross-entropy (CE) distance, between the theoretical zero- 
variance change of measure and the importance sampling distribution. The 
estimation method thus obtained is called the simulated cross- entropy or just 
the cross-entropy (CE) method. One of the most useful aspects of the CE 
method is that the optimal reference parameter for an SLR estimator can be 
effectively estimated without requiring a detailed pre-analysis of the model 
(for example involving large deviations). We will mostly limit the exposition 
to static simulation models, that is, models in which time does not play a 
role. Many problems in operations research and manufacturing fall into that 
category. Various problems in finance — an area that has received much recent 
attention from the simulation community — also have a static nature [31]. 
However, an application of the CE method to a dynamic model, a queueing 
model, will be given in Section 3.10. As mentioned, an attractive feature of 
the CE method is that it can be readily modified for efficient estimation of 
the optimal solution of NP-hard combinatorial optimization problems and for 
machine learning, which is the subject of the next chapters. 

We show that the CE method is readily applicable to both light- and heavy- 
tailed distributions; see Section 1.2. Because by definition the exponential 




3.1 Introduction 



61 



moments do not exist for heavy-tailed distributions, the exponential change 
of measure is intrinsically impossible for heavy-tailed distributions when a 
positive twisting parameter is required. So an alternative method must be 
used. In their landmark paper, Asmussen, Binswanger, and Hpjgaard [14] 
consider various estimators for rare events of the form {Sn > x}, where 
5iv is the random or deterministic sum of i.i.d. positive random variables 
with subexponential pdf, f{x) say. Two logarithmically efficient estimators are 
given. The first one, based on Asmussen and Binswanger [13] uses conditional 
Monte Carlo [148] in combination with order statistics. The second estimator 
uses importance sampling, where the IS density, h{x) say, consists of two 
parts: for small values of x, g{x) is proportional to f{x) and for large values 
of X, g{x) is much larger than /(x), decreasing slightly faster than 1/x. Juneja 
and Shahabuddin [91] consider a similar problem to the one in [14] and their 
approach is to estimate {S^ > x} via IS using a density h{x) which is obtained 
from the original f{x) by “twisting” the hazard rate. Several variations of this 
idea are considered. Note that all the above heavy-tail methods have limited 
application since they deal basically only with the estimation of probabilities 
of the above events {Sn > x}. 

We address the selection of a proper class of IS distributions for both 
light- and heavy-tailed distributions via the transform likelihood ratio (TLR) 
method [103]. The idea is to transform the random variables and to apply a 
change of measure to the distribution of the transformed random variables. 
This simple “change of variable” technique allows us to transform an original 
rare-event probability with heavy-tail distributions to an equivalent (auxil- 
iary) one with an arbitrary tail distribution, such as the uniform or exponen- 
tial distribution, and then we apply a change of measure to the new (aux- 
iliary) distribution. We typically transform to light-tailed distributions, and 
then apply the ECM or the SLR method to obtain a convenient class of IS 
distributions. Recall that in the latter case, the IS distribution belongs to the 
same parametric family as the original auxiliary one. As mentioned before 
we shall use the CE method to estimate the optimal parameter vector of the 
(parametric) IS distribution. 

The remainder of this chapter is organized as follows: In Section 3.2 we 
describe the main ideas behind importance sampling and the SLR approach. 
In Section 3.3 the CE method is discussed. In Section 3.4 we present an 
adaptive CE algorithm for the estimation of the optimal parameters in the 
importance sampling density for rare-event simulation. The procedure, which 
involves a sequence of stochastic optimization subproblems, can be applied 
quite generally. It works particularly well if the underlying distributions have 
finite support or if they belong to a natural exponential family, since in those 
cases there are analytical solutions to those optimization subproblems. We 
provide various examples in Section 3.5. Convergence issues of the CE method 
are illustrated and discussed in Section 3.6. In Section 3.7 we show how CE 
can be used for finding the root of certain nonlinear equations. Section 3.8 
deals with TLR method, and its application to heavy-tail distributions. In 




62 



3 Efficient Simulation via Cross-Entropy 



Section 3.10 we show how CE can be applied to efficiently estimate the tail 
probabilities of waiting time in a queueing system. Section 3.11 discusses the 
“big-step” modifications of the CE method. Various numerical examples are 
given in Section 3.12. A direct proof of polynomial complexity pertaining to 
Weibull random variables with heavy tails is given in the appendix of this 
chapter. 



3.2 Importance Sampling 

Here we review some background material from [148]. Let ^ be the expected 
performance of a stochastic system given in the form 

i = EfH{X) = Ef ^(S(X);j) = J ^(5(x); 7 ) /(x) M(dx) . (3.1) 

S here is the sample performance function^ 7) is a real- valued function of 
the sample performance, which depends on a parameter 7, / is the density 
of X with respect to some measure /i; see Section 1.2. The subscript / in 
E/JT(X) means that the expectation is taken with respect to the density /. 
Examples of (p{S(X);'y) are indicator functions 

yp(5(X);7) = /{s(X)^7} (3-2) 

and Boltzmann functions 

^(5(X);7)=exp(-5(X)/7). (3.3) 

An example of the above framework is the stochastic shortest path problem 
illustrated in Section 2.2.1. The shortest path in a stochastic network can be 
defined as 

S{X)= min (3.4) 

j=h--,P 

leBj 

Here, Bj is the j-th complete path from a source to a sink, p is the number of 
complete paths, and X = (Xi, . . . , X^) is the random vector whose compo- 
nents X^, i = 1, . . . , n, represent the durations (weights) of the links. For the 
longest path we replace min with max in (3.4). 

Let g be another probability density such that Hf is dominated by g. That 
is, g{x) = 0 => i7(x)/(x) = 0. Using the density g we can represent ^ as 

i = I H{^) ® 5(x) /x(dx) = E,H{X) ® , (3.5) 

where the subscript g means that the expectation is taken with respect to g, 
which is called the importance sampling (IS) density. 




3.2 Importance Sampling 63 



An unbiased estimator of i is 

^ 1 ^ 
i—\ 

where t is called the importance sampling (IS) or the likelihood ratio (LR) 
estimator, 

tr(x) = /(x)/p(x) (3.7) 

is called the likelihood ratio (LR), and Xi, . . . , Xjv is a random sample from g, 
that is, Xi, . . . , Xjv are i.i.d. random vectors with density g. In the particular 
case where there is no “change of measure” that is, (^ = /), we have W = 
and the LR estimator in (3.6) reduces to the following crude Monte Carlo 
(CMC) estimator: 

(3.8) 

i=l 

where Xi, . . . , Xjv is a random sample from the density /. 

The choice of the IS density g is crucial for the variance of the estimator 
£ in (3.6), and we consider next the problem of minimizing the variance of £ 
with respect to g, that is 



It is well known (see 
is 



Note that if H{x.) ^ 






(3.9) 



for example [148]) that the solution of the problem (3.9) 
l^^(x)|/(x) 



5*(x) = 

0, then 



/l-f^(x)|/(x)M(dx) 



(3.10) 



/(x) 



-g(x)/(x) 

I 



(3.11) 



and 

Var^.(?) = Var^.(F(X)H"(X)) = Var^*(^) = 0 . 

The density as per (3.10) and (3.11) is called the optimal importance sam- 
pling density. 

Example 3.1. Let X ~ Exp(u“^), and H{X) = /{x^7>5 for some 7 > 0. Let / 
denote the pdf of X. Consider the estimation of 

£ = ¥.uH{X) = 






dx = e~'^^' 




64 3 Efficient Simulation via Cross-Entropy 

We have 












Thus, the optimal importance sampling density of X is the shifted exponential 
distribution. Note that Hf is dominated by , but / itself is not dominated 
by g* . Since is optimal, the LR estimator t is constant, that is 



l = H{X)W{X) = 



H{X)f{X) 

H{x)f{x)/e 



In general, implementation of the optimal importance sampling density g* 
as per (3.10) and (3.11) is problematic. The main difficulty lies in the fact that 
to derive g*(x) one needs to know £. But £ is precisely the quantity we want 
to estimate from the simulation! In most simulation studies the situation is 
even worse, since the analytical expression for the sample performance H is 
unknown in advance. To overcome this difficulty one can make a pilot run with 
the underlying model, obtain a sample if(Xi), . . . ,iJ(X;v), and then use it 
to estimate g* . It is important to note that sampling from such an artificially 
constructed density might be a very complicated and time consuming task, 
especially when ^ is a high-dimensional density. 

Since sampling from the optimal importance sampling density g* is prob- 
lematic we consider next the case when the underlying density / belongs to 
some parametric family ^ = {/(-;v), v G V}. Prom now on, we will assume 
that this is the case. Let /(•; u) denote the density of the random vector X in 
(3.1), for some fixed “nominal” parameter u G V. We now restrict the choice 
of IS densities g to those from the same parametric family T; so g differs from 
the original density /(-;u) by a single parameter (vector) v, which we will 
call the reference parameter. We will write the likelihood ratio in (3.7), with 
S-(x) = /(x;v), as 

In this case the LR estimator (. in (3.6) becomes 
^ 1 ^ 



where Xi, . . . ,Xjv is a random sample from /(•; v). We will call (3.13) the 
standard likelihood ratio (SLR) estimator^ in contrast to the (nonparametric) 
LR estimator (2.11). To find an optimal v in the SLR estimator £ we need to 
consider the program (3.9), which reduces to 



mm Varvi7(X) W(X; u, v) . (3.14) 



Since under /(*;v) the expectation £ = Evi7(X) W(X; u, v) is constant, the 
optimal solution of (3.14) coincides with that of 




3.2 Importance Sampling 65 



mmy(v) = inin Evi/^(X) u, v) . (3.15) 

The above optimization problem can still be difficult to solve, since the density 
with respect to which the expectation is computed depends on the decision 
variable v. To overcome this obstacle, we rewrite (3.15) as 

miny(v) = minEwi^^(X) IT(X;u, v) VF(X;u, w) . (3.16) 



Note that (3.16) is obtained from (3.15) by multiplying and dividing the in- 
tegrand by /(x; w) where w is an arbitrary reference parameter. Note also 
that in (3.15) and (3.16) the expectation is taken with respect to the densities 
/(•;v) and /(*;w), respectively. Moreover, W'(X;u,w) = /(X; u)//(X; w), 
and X ~ /(x;w). Note finally that for the particular case w = u we obtain 
from (3.16) 

min V (v) = min (X) W (X; u, v) . (3.17) 

V V 

We shall call each of the equivalent problems (3.14)-(3.17), the variance min- 
imization (VM) problem; and we call the parameter vector ^cV, that minimizes 
the programs (3.14)-(3.17) the optimal VM reference parameter vector. 

Example 3.2 (Example 3.1 continued). Consider again estimating i = ¥u{X ^ 
7 ) = exp{—^u~^). In this case, using the family {f{x;v),v > 0} defined by 
f{x;v) = v~^ exp{xv~^)^x ^ 0, the program (3.17) reduces to: 



min V (v) = min 

V V 




^_(2n 



min 

v'^uj2 



V" 

U 



Q-nf(2u ^-v 

{2v — u) 



Note that this follows directly from (1.40) and Table 1.5. The optimal reference 
parameter ^v is given by 

*v=^\l + u+ +u2| = ^ + I + 0((w/7)^) . 

where 0{x^) is a function of x such that 

hm — — = constant. 
x ->0 x ^ 



We see that for 7 tx, is approximately equal to 7 . 

Typically, the optimal reference parameter ^cV that minimizes the programs 
(3.14)-(3.17) is not available analytically, since the expected performance 
Ev^7(X) cannot be evaluated exactly. To overcome this difficulty, we can 
replace the expected values in (3.14)-(3.17) by their sample (Monte Carlo) 
counterparts and then take the optimal solutions of the associated Monte 
Carlo programs as estimators of *v. For example, the Monte Carlo counter- 
part of (3.17) is 




66 



3 Efficient Simulation via Cross-Entropy 



^ E > (3-18) 

i=l 

where Xi,...,Xjv is an i.i.d. sample generated from the original density 
/(•;u). This type of procedure has been well studied in the literature, ap- 
pearing under various names such as stochastic counterpart method^ stochastic 
optimization^ or the simulated VM program. Under proper assumptions, it is 
possible to show that the optimal solutions of (3.18) converge to *v as N goes 
to infinity and it has a normal asymptotic distribution. See [148] for more 
details. 

We now borrow from [148] a simple recursive algorithm for estimating the 
optimal VM reference parameter *v (see Algorithm 8.4.1 in [148]), which is 
based on the stochastic counterpart of (3.16), that is, on 

1 ^ 

mm V(v) = imn - ^ ; u, v) W{Xi ; u, w) , (3.19) 

2=1 

where Xi, . . . , X^r is an i.i.d. sample from /(•; w), for any w. 



Algorithm 3.2.1 (VM Algorithm) 

1. Choose an initial tilting vector V(o), for example V(q) = u. Set k = 1. 

2. Generate a random sample Xi, . . . , Xjv from /(•; V(fc_i)). 

3. Solve the stochastic program (3.19). Denote the solution by thus 

1 ^ 

V(fe) =argmin— ^if2(Xi)W^(Xi;u,v)lV(Xi;u,V(fe_i)) . (3.20) 

^ 2=1 

4- If convergence is reached, say at k = K, then stop; otherwise increase k 
by 1 and reiterate from Step 2. 



Remark 3.3 (Stopping Criterion). A possible stopping criterion is to stop when 
all the parameters have ceased to increase or decrease monotonously. More 
precisely, for the j-th. component of v let Kj be the iteration at which the 
sequence i^(i)j, 1 ^( 2 ),^, • • • starts “fiuctuating,” that is, Kj is such that 



either 




V(0),j < < V^2),j ■ • 




or 




V(0),j > > V(2),J • • 




Now stop at the iteration 





K = m^ Kj . 




3.3 Kullback-Leibler Cross- Entropy 



67 



3.3 Kullback-Leibler Cross-Entropy 

An alternative to variance minimization for estimating the optimal reference 
parameter vector in (3.13) is based on the Kullback-Leibler cross- entropy^ 
which defines a “distance” between the two pdfs g and h and can be written 
(see also Section 1.5) as 

li{dx) 

(3^21) 

= / g{x) In g{-s) fj.{dx) - / g{x)lnh{x) n{dx) . 

Recall from Section 1.5 that D(^, h) ^ 0, with equality if and only if g = h 
(on a set of /x-measure 1). The idea behind the CE method is to choose the 
IS density h such that the Kullback-Leibler distance between the optimal 
importance sampling density g* in (3.10) and h is minimal^ that is, the CE 
optimal IS density h* is the solution to the functional optimization program 

minB(^*,/i) . 
h 

But, it is immediate from 2)(^*,h) ^ 0 that h'' = g*. Hence, if we optimize 
over all densities h, the VM and CE optimal IS densities coincide. 

However, when using the SLR approach, the class of densities is restricted 
to some family {/(•; v), v G V} that contains the nominal density /(•; u). The 
CE method now aims to solve the parametric optimization problem 

minD(g*,/(-;v)) , 

V 






with 

fir*(x) =c”^|i?(x)|/(x;u) , 

where c = J |iJ(x)|/(x; u) ii{dyi). Since the first term in the right-hand side 
of (3.21) does not depend on v, minimizing the Kullback-Leibler distance 
between above and /(•; v) is equivalent to maximizing^ with respect to v, 

y |i/(x)| /(x;u) ln/(x;v)/i(dx) =Eu|if(X)| ln/(X;v) . 

Let us now assume for simplicity that H(K) ^ 0, thereby dropping the 
absolute signs in the formulas above. We find that the optimal reference pa- 
rameter (with respect to Kullback-Leibler distance) is the solution to 

maxD(v) = maxEu iJ(X) ln/(X; v) . (3.22) 

V V 

This is equivalent to the program 

max D(v) = max H{X) 1T(X; u, w) In /(X; v) , (3.23) 

V V 




68 



3 Efficient Simulation via Cross-Entropy 



for any tilting parameter w, where l^(X;u,w) is again the likelihood ratio 
given in (3.12). 

In analogy to the VM program (3.17) we call (3.23) the cross-entropy (CE) 
program; and we call the parameter vector v* that minimizes the program 
(3.23) the optimal CE reference parameter vector. 

Similar to (3.19), we can estimate the optimal solution v* by the optimal 
solution of the program 



maxH(v) = max ^^iJ(Xi)H^(X,;u,w) ln/(Xi;v) , (3.24) 

i=l 

where Xi, . . . , X;v is a random sample from /(•; w). We shall call the program 
(3.24), the stochastic counterpart of the CE program (3.23) or simply the 
simulated CE program to distinguish it from the VM counterpart (3.19). In 
typical applications the function D in (3.24) is concave and differentiable with 
respect to v [149] and, thus, the solution of (3.24) may be readily obtained 
by solving (with respect to v) the following system of equations: 

1 ^ 

- H{Xi) W{Xi-, u, w) V In f{Xf, v) - 0 , (3.25) 

^ i=l 

where the gradient is with respect to v. Similarly, the solution to (3.23) may 
be obtained by solving 



Eui7(X) V In /(X; v) = 0 , (3.26) 

provided the expectation and differentiation operators can be interchanged 
[149]. Note that the function v V ln/(x; v) is the score function, see (1.27). 
In particular, for exponential families (1.3), with 0 = '^(u) for some parame- 
terization 'll;, the solution to (3.23) satisfies, by (1.28), 

Eu//(X) + t(x)) = 0 , (3.27) 

where r/ = V^(v). It is interesting to compare formulas (3.25) and (3.26) with 
similar formulas such as (1.39) and (1.38) of the score function method. We see 
for example that for v = u the left-hand side of (3.26) is simply the gradient 
of the expected performance: V^(u) = VEui7(X). A notable difference in 
the estimator in (3.25) for the CE method and (1.39) for the score function 
method is that in the former case u is fixed and v variable, whereas in the 
latter case u is variable and v is fixed. 

We shall consider the CE programs (3.23) and (3.24) as alternatives to 
their VM counterparts (3.16) and (3.19), respectively and shall show that 
their optimal solutions are nearly the same. This leads to the following CE 
analog of Algorithm 3.2.1: 




3.3 Kullback-Leibler Cross-Entropy 



69 



Algorithm 3.3.1 (CE Algorithm) 

1. Choose an initial tilting vector for example = u. Set k = 1. 

2. Generate a random sample Xi, . . . , Xjv from /(•; 

3. Solve the simulated cross-entropy program (3.24)- Denote the optimal so- 
lution by thus 

1 ^ 

v(fc) = argmax — ^ H{Xi) W (X*; u, In /(X^; v) . (3.28) 

2=1 

4- If convergence is reached, say at k = K, then stop; otherwise increase k 
by 1 and reiterate from Step 2. 

We could use the same stopping criterion as in Remark 3.3. 

We shall show that optimal solutions of the CE programs (3.23) and (3.24) 
and their VM counterparts (3.16) and (3.19) are nearly the same. The advan- 
tage of Algorithm 3.3.1 as compared to Algorithm 3.2.1 is that in (3.28) 
can often be calculated analytically. In particular, this happens if the distribu- 
tions of the random variables belong to the natural exponential family (NEF) 
or if / is a discrete pdf with finite support. 

Note that for the VM programs there is typically no analytic solution even 
for NEF distributions. Thus, numerical optimization procedures must be used 
in such cases. This emphasizes a big advantage of the CE approach over its VM 
counterpart. The following two examples deal with NEF and finite support 
discrete distributions, respectively. 

Example 3.4 (Example 3.2 continued). Consider again estimation of £ = 
Fu{X ^ 7 ) = exp(— In this case, program (3.22) becomes: 

max jD(i;) = max J e~^'^ ln^i;“^e'"^^ ^ dx 

noo 

= max / u~^ e~^'^ {—Inv — xv~^) dx 

^ Jj 

. e~^^ ^ ('y u vlnv) 

= mm ^ . 

V V 

It is easily verified that the optimal CE reference parameter is 7 + u. As in 
the VM case, for 7 the optimal reference parameter is approximately 7 . 

Example 3.5 (NEF distributions). Suppose that the univariate random vari- 
able X belongs to a NEF of distributions parameterized by the mean v. Thus 
(see Section 1.3) the density of X belongs to a class {f{x;v)} with 

with EyX = (^'{6{v)) = v and Var^;(X) = C"{6{v)). Specifically, let X ~ 
f{x;u) for some nominal reference parameter u. For simplicity assume that 




70 



3 Efficient Simulation via Cross-Entropy 



E^iJ(X) > 0 and that X is nonconstant. Writing 9 = 6{v) we have from (3.26) 
and (1.28) that the optimal reference parameter is given by the solution of 

EuH(X) +X^= E^H(X) (-v + X) = 0, 

where we have used the fact that c(0) = Hence v* is given by 



, ^ E^ff(X)X E^W(X;u,w)II(X)X 

^ E^H(X) E^H(X)W(X-,u,w) ’ ^ 

for any reference parameter it;. It is not difficult to check that v* is indeed a 
unique global maximum of D(v) = EuJI(X) lnf(X;v). 

The estimate v of v* in (3.29) can be obtained analytically from the solu- 
tion of the stochastic program (3.24), that is, 






V = 



EZi ff(^imXi;u,w) 



(3.30) 



where Xi, . . . ,Xjv is a random sample from the density /(•; w). 

A similar explicit formula can be found for the case where X = {Xi , . . . , 
Xn) is a vector of independent random variables such that each component Xj 
belongs to a NEF parameterized by the mean. In particular, if u = {ui, . . . , Un) 
is the nominal reference parameter, then for each j = 1, . . . , n the density of 
Xj is given by 

It is easy to see that problem (3.23) under the independence assumption 
becomes “separable,” that is, it reduces to n subproblems of the form above. 
Thus, we find that the optimal reference parameter vector v* = 
is given as 



, _ EuH{X)Xj _E^H{X)W{X;u,w)Xj 
~ EuH(X) “ Ew/f(X)H"(X;u,w) 

Moreover, we can estimate the j-th component of v* as 
^ EZiHi^iW{Xi-,u,w)Xjj 

EZiH(^i)W{Xi-,u,w) ’ 



(3.31) 



(3.32) 



where Xi, . . . ,Xjv is a random sample from the density /(•; w), and Xij is 
the jth component of X^. 



Example 3.6 (Finite support discrete distributions). Let X be a discrete ran- 
dom variable with finite support, that is, X can only take a finite number 
of values, say {ai,...,a^}. Let Uj = P(X = cij)^j = and define 




3.3 Kullback-Leibler Cross-Entropy 



71 



u= {ui, . . . ,Um)- The distribution of X is thus trivially parameterized by the 
vector u. We can write the density of X as 



m 

f{x-,u) = . 

j = l 



From the discussion at the beginning of this section we know that the 
optimal CE and VM parameter coincide^ since we optimize over all densities 
on {ai, . . . , Um}- By (3.10) the VM (and CE) optimal density is given by 






H{x)f{x\vi) 

'LxH{x)f{x-,\x) 



~ EuH(X) 

_ H{aj)uj 

^Eu77(X) 

j = l 



SO that 

, _ E^H{X)I^x=a,} _ E^H{X)W{X-,u,w)I^x=a,} 
E^H{X) Ewi7(X)iy(X;u,w) 



(3.33) 



for any reference parameter w, provided that Kj^H{X)W{X;u^w) > 0. 

The vector v* can be estimated from the stochastic counterpart of (3.33), 
that is as 

where Vi, . . . ,X^ is an i.i.d. sample from the density /(•; w). 

A similar result holds for a random vector X = (Xi,...,Xn) where 
Xi,..., Xn are independent discrete random variables with finite support, 
characterized by the parameter vectors Ui , . . . , . Because of the indepen- 

dence assumption, the CE problem (3.23) separates into n subproblems of the 
form above, and all the components of the optimal CE reference parameter 
V* = (vj, . . . , V*), which is now a vector of vectors, follow from (3.34). Note 
that in this case the optimal VM and CE reference parameters are usually not 
equal, since we are not optimizing the cross entropy over all densities. See, 
however. Proposition 4.2 for an important case where they do coincide and 
yield a zero- variance LR estimator. 

The updating rule (3.34), which involves discrete finite support distribu- 
tions and, in particular, the Bernoulli distribution, will be extensively used 
for combinatorial optimization problems in the rest of the book. 




72 



3 Efficient Simulation via Cross-Entropy 



Example 3.7. Let X be a random variable having the following discrete dis- 
tribution 

m 

Ui = P(X = i), i = 1, m; Ui = l . 

i=l 

Our goal is to estimate i = P(X ^ 7) using the LR estimator (3.6), with 
H{X) = Assume for simplicity that 7 G m}. Prom Exam- 

ple 3.6 we see immediately that both VM and CE optimal reference parame- 
ters are given by 



0 if 1 < i ^ 7 - 1 

if 7 ^ 2 ^ m . (3.35) 

i=7 

Using this change of measure yields obviously a zero-variance LR estimator 
(3.6). 




3.4 Estimation of Rare-Event Probabilities 

This section we shift our attention to the CE method in the specific context 
of rare-event simulation. Note that the results of the previous sections have a 
more general scope. 

Let 5(X) again denote the sample performance, where X ~ /(s^) stnd 
suppose that we wish to estimate i = Pu(5(X) ^ 7) = Eu/{5(x)^7>? for 
some fixed level 7. Note that this estimation problem presents a particular 
case of (3.1) with H{X) = /{5(x)^7>- Assume as before that X has a density 
/(•; u) in some family {/(•; v)}. We can estimate the quantity £ using the SLR 
estimator (see also (3.13)) 

^ 1 ^ 

^{S(x 0>7} u, v) . (3.36) 

2=1 

It is worth noticing that the optimal (zero-variance) change of measure 
for this case has an easy interpretation. Namely, from (3.10) we have, with 
.4= {x : 5(x) ^ 7}, 

o*fx) = I L ") ’ if 5'(x) ^ 7 , 

^ ^ ^ [0 , else. 

We see that q* is the conditional density of X ~ /(•; u) given that the event 
{5(X) ^ 7} occurs. 

We repeat the main ideas and results of Section 2.3 leading to the main 
algorithm of this chapter. An important observation to make is that the simu- 
lated CE program (3.24) is of little practical use when rare events are involved. 




3.4 Estimation of Rare-Event Probabilities 



73 



because most of the indicators H(Ki) = /{ 5 (Xi)^ 7 > will be zero. For these 
problems a two-phase CE procedure is employed in which both the reference 
parameter v and the level 7 are updated, thus avoiding too many indicators 
that are zero, and creating a sequence of two-tuples with the goal 

of finding/estimating the optimal CE reference parameter v*. 

Starting with vq and a g not too small, say g = 0.01, the two phases are 
as follows: 

1 . Adaptive updating of 7 ^. For a fixed vt_i, let 7 t be a (1 — g)-quantile 
of S'(X) under Vt_i. That is, 7 ^ satisfies 

Pv,_,(5(X)>7t)^^, (3.37) 

Pv._,(5(X)<7t)^l-e, (3.38) 

where X ~ /(•; vt_i). A simple estimator % of 7 t is the order statistic 

7t = 5'([-(i_^)ivi) • (3.39) 

2. Adaptive updating of v^. For fixed 7 ^ and vt_i, derive Vt from the 
solution of the following CE program 

m^D{v) = m^Ev,_i/{ 5 (x)^ 7 t}^(x;u,vt_i)ln/(X; v) . (3.40) 

The stochastic counterpart of (3.40) is as follows: for fixed % ^nd vt_i, 
derive Vt from the following program 

1 ^ 

m^5(v) - max ^ E ln/(Xj;v) . (3.41) 

2=1 

As seen before, the optimal solutions of (3.40) and (3.41) can often be 
obtained analytically, in particular when /(x;v) belongs to a NEF or is a 
density with finite support; see Examples 3.5 and 3.6. 

For ease of reference we repeat Algorithm 2.3.1. 

Algorithm 3.4.1 (Main CE Algorithm for Rare-Event Simulation) 

1 . Define vq = u. Set t = 1 (iteration = level counter). 

2. Generate a sample Xi,...,Xiv from the density /(-;vt_i) and compute 
the sample (1 — g) -quantile % of the performances according to (3.39), 
provided % is less than 7 . Otherwise set 7 t = 7 . 

3. Use the same sample Xi, . . . ,Xiv to solve the stochastic program (3.41) • 
Denote the solution by Vt . 

4 . If%<^, set t = t-\-\ and reiterate from Step 2. Else proceed with Step 

5. 

5. Estimate the rare-event probability £ using the SLR estimator £ in (3.36), 
with V replaced by vt, where T is the final number of iterations (= number 
of levels used). 




74 



3 Efficient Simulation via Cross-Entropy 



Remark 3.8. In typical applications a much smaller sample size N in Step 2 
can be chosen than the final sample size in Step 5. When we need to distin- 
guish between the two sample sizes, in particular when reporting numerical 
experiments, we will use the notation N and for Step 2 and 5, respectively. 

Remark 3. 9. To obtain a more accurate estimate of v* it is sometimes useful, 
especially when the sample size is relatively small, to repeat Steps 2-4 for a 
number of additional iterations after level 7 has been reached. 

Note that Algorithm 3.4.1 breaks down the “hard” problem of estimating 
the very small probability I into a sequence of “simple” problems, each time 
generating a sequence of pairs {(7t, v^)} depending on which is called the 
rarity parameter. 

Remark 3.10. It is important to understand the diflFerences between Algo- 
rithm 3.3.1 and Algorithm 3.4.1. The major difference is that Algorithm 3.4.1 
estimates the optimal reference parameters for multiple levels {7^} whereas 
in Algorithm 3.3.1 only one level 7 is used. With this in mind, we call Al- 
gorithm 3.4.1 a multilevel procedure (where the levels are indexed by t) and 
Algorithm 3.3.1 a single-level procedure. At the t-th level in Algorithm 3.4.1 
we have to estimate (from the same sample) first 7^ and then v^. 

We also present the deterministic version of Algorithm 3.4.1. 

Algorithm 3.4.2 (Deterministic Version of the CE Algorithm) 

1. Define vq = u. Set t = 1. 

2. Calculate 7 ^ as 

7t = max {s : Pv,_i(5(X) > s) ^ , (3.42) 

provided 7^ < 7; otherwise set 7^ = 7. 

3. Calculate Wt (see (3.40)) as 

vt = argmaxEv,_i/{5(x)^7t}^(x;^^Vt_i)ln/(X; v) . (3.43) 

V 

4 . If'yt = 7 , then stop; otherwise sett = t-\-l and reiterate from Step 2. 

Note that as compared with Algorithm 3.4.1 Step 5 is redundant in Algo- 
rithm 3.4.2. 

It is not a simple matter to see whether or not Algorithm 3.4.2 and Algo- 
rithm 3.4.1 will reach the desired value 7 say after a finite number of iterations, 
provided £. 




3.5 Examples 



75 



3.5 Examples 

For better insight into the single-level Algorithm 3.3.1 (updating only the ref- 
erence vector) and the multiple-level Algorithm 3.4.1 (updating the reference 
vector and the level) we now present several examples. Although in some of 
those examples the quantities of interest can be computed analytically, we 
present them to illustrate the algorithms. In particular, we show that the op- 
timal reference parameter vectors *v and v* of the VM and CE programs are 
nearly the same. We start with a most simple example of the deterministic 
multilevel Algorithm 3.4.2. 

Example 3.11. Let X ~ Exp(l). Suppose we want to estimate the probability 
P(A > 7), {X ~ {Exp(t>~^)}) for large 7, say 7 = (16, 64, 256) (correspond- 
ing to £ ^ (10“^, 10~^^, 10“^^^), respectively. Prom (3.42) it is obvious that 
7t = min (7, —Vt-ilng), and from Example 3.4 we find = 7t -h 1. If we take 
^ = 0.02 then it is readily seen that the deterministic Algorithm 3.4.2 termi- 
nates at iteration T = 2, 3 and 4, respectively. And if we take g = 0.001 and 
the same values of 7, then Algorithm 3.4.2 terminates at iteration T = 2, 3 
and 3, respectively. Table 3.1 presents the evolution of 7t and vt for fixed 
7 = 64 and ^ = 0.02 and p = 0.001, respectively. 



Table 3.1. The evolution of 7 t and vt 
g = 0.02 

t 7t Vt 

0 - 1 

1 3.912 4.912 

2 19.215 20.215 

3 64 65 



for 7 = 64, and g = 0.02, and g = 0.001. 
g = 0.001 

t 7t Vt 

0 - 1 

1 6.907 7.907 

2 54.625 55.625 

3 64 65 



It follows from Table 3.1 that for 7 = 64 in both cases Algorithm 3.4.2 
terminates after three iterations (7T = 73 = 64). Figure 3.1 illustrates the 
three-level procedure. 



Example 3.12 (Minimum of exponential random variables). Suppose we are 
interested in estimating £ = £{^) = P(S'(X) > 7), where 



5(X) = min(Xi,...,X0 



(3.44) 



and the random variables Xi , . . . , Xn are exponentially identically distributed 
with mean u; thus each Xi has density exp{—xu~^),x ^ 0. 

Obviously, 

n 

£ = IJP(Xi > 7) = 

i=l 



(3.45) 




76 



3 Efficient Simulation via Cross-Entropy 




Fig. 3.1. A three-level realization of Algorithm 3.4.2. All shaded regions have area 
g. The density of /(*; ^ 3 ) has been omitted. 

For large 7, the squared coefficient of variation (SCV) of the crude Monte 
Carlo (CMC) estimator (see (1.10)) is 

Hence the CMC estimator has exponential complexity in 7. 

VM Approach. It is easy to verify from (1.40) and Table 1.5 that for i.i.d. 
and exponentially distributed random variables we have for any perfor- 
mance function H that 

= P.46) 

It follows (see also Example 3.2) that the variance of the estimator (3.36) is 
Var(?) = ^ 





3.5 Examples 



77 



Consequently, the SCV of t is given by 






N 






u{2v — u) 



- 1 



As in Example 3.2 we can readily obtain that under the variance minimization 
criterion the optimal reference parameter is given by 

^ {7 + « + + = 7 + I + 0((w/7)^) • 

CE Approach. Since the distribution of X lies in an exponential family of 
the form (1.3), with 0 = ~ obtain 

directly from (3.27) that v* satisfies 



Enl"{5(X)^7} 



n 

V 






= 0, 



with 77 = 1/7;*. It follows that 



(3.47) 



= E„ 



n 






i=l 



S{X) ^ 7 



u + 7 



For large 7 » a we have «u«u*« 7, so that for both VM and CE approaches 

«"(7)«^7”e"(2a)-”, (3.48) 

where N is the sample size. That is, for large 7, the SCV ^^(7) of the CMC 
and of both VM and CE optimal LR estimators increase in 7 exponentially 
and polynomially, respectively. In other words, the CMC and both optimal 
VM and CE estimators can be viewed as exponential and polynomial ones. 



Example 3.13 (Sum of exponential random variables). Suppose we are inter- 
ested in estimating 



£{j) = P(5(X) ^ 7) = F(^i 4- • • • + Xn ^ 7) , (3.49) 



where the random variables Xi , . . . , Xn are exponentially identically dis- 
tributed with mean u. Of course, in that case we know that *5(X) has a 
Gamma distribution with parameters n and so P(5(X) > 7) can be 
computed exactly as 



^(7) = P„(^(X) ^ 7) 



n— 1 



E 



e-'^“ '(7u-i)*= 
A:! 



It is difficult to compute the VM-optimal parameter in this case. However, we 
can compute the CE-optimal parameter v* given by (3.31). Since Xi, . . . , X^ 




78 



3 Efficient Simulation via Cross-Entropy 



are i.i.d. it is clear that all components of v* are identical. After some algebra, 
we obtain that 

Efe=0 P«(-^i H 

where A^+i is independent of Xi, . . . ,X^ and has the same distribution as 
Xi, . . . ,X^. Prom the above expression, it is easy to see that, when 'y/u is 
large (that is 7 ti), we have 

V* ^ {u-\- ^)ln ^ 7/n . (3.51) 



Let us consider the asymptotic properties of the SCV of the LR estimator i 
in (3.6), as 7 ^ 00. Similar to the previous example we have 



N X /^^(7) = 



u{2v — u) 



P{l) 



1 , 



(3.52) 



where we have used (3.46). For large 7 and fixed w the dominating term in 

Fw{Xi H h Xn > 7) is (71^;“^)’^“^ exp(— 7tf;“^)/(n — 1)! . We can use this 

to show that for reference parameter = 7/n, we have 

AT 2/ \ e’^(n-l)! f 7 2n — 3l ^ 

as 7 ^ 00. That is, the SCV of the optimal LR estimator grows linearly 
with respect to 7. Notice that, for = n — which corresponds to the CMC 
estimator — the SCV increases exponentially in 7 . 



Example 3.14 (Heavy tails). The CE method is not limited to light-tail distri- 
butions. We can also apply it to heavy-tail distributions. To illustrate this, we 
generalize Example 3.12 for n = 1 to the Weibull case. Specifically, consider 
estimation oi I = P(X ^ 7) with X ~ Weib(a, n”^), 0 < a < 1. That is, X 
has density 

f{x]u) = au~^{u~^x)°' ^ e“ , x>0. (3.53) 



To estimate ^ via the CE method we shall use the family of distributions 
{Weib(a, t?“^), > 0}, where a is kept fixed. Note that for a = 1 we have the 

class of exponential distributions. 

Using the CE approach, we find the optimal CE reference parameter by 
solving 

pOO 

maxD{v) = max f{x;u)lnf{x;v)dx, 

V 

or, equivalently, by solving 



f 



/(x; u) — In /(x; v)dx = 0 . 



(3.54) 




3.5 Examples 



79 



Substituting (3.53) into (3.54) yields the following simple expression for the 
optimal CE reference parameter v*: 

V* = (u“ + 7“)i/a . (3.55) 



This is true for any a> 0. Note that {Weib(a, > 0} is an exponential 

family of the form (1.3), with t{x) = — x", 0 = c{6) = 6 and h{x) = 

ax^~^. So we can obtain (3.55) similar to (3.47) as the solution to 






n-x- 

U 



= 0 , 



(3.56) 



with T] = v 
equivalently 



. It follows that the optimal t/* is given by 1/E„ [X“ | X ^ 

V ) 



7], or 



(3.57) 



Similar to Example 3.12 the variance of the SLR estimator I for any ref- 
erence parameter v can be easily found from (1.40) and Table 1.5. Namely, 
using the exponential family representation above we have 



Var(?) = 



1 

N 

1 

N 

N 

j_ 

N 



{EuI[x^^}W{X-,u,v)-e^} 

( ^ 7) _ ^2 1 

\ {u/v)^{2 — (u/v)^) j 

\{u/vY{2-{u/vY) j 

\{u/v)^{2-{u/v)^) J 



If we substitute v with and divide by , we find that the SCV of £ is 
given by 



1 j exp ( il'c^^V ) 

N I 2(7/w)“ + 1 



It follows that for large ^/u 



K, 



2 



1 e 
N 2 




a 



In other words, the SLR estimator t has polynomial complexity in 7, for 
any a > 0, including the heavy-tail case 0 < a < 1. It is a common misunder- 
standing that IS only works for light-tail distributions. In this example we saw 
that polynomial complexity can be easily obtained by using the CE method. 
But we can do even better. In Section 3.8 we will see how with the TLR 




80 



3 Efficient Simulation via Cross-Entropy 



method we can in fact achieve an SLR estimator with bounded relative error, 
meaning that the is bounded by c/AT for some constant c which does not 
depend on 7. 

Remark 3.15 (Weihull random variables). As a generalization of Example 3.14 
consider the estimation of £ = P(5(X) ^ 7) where the components Xi, 
i = of X are independent and Xi ~ \Ne\h{ai,u~^), i = 

Consider the change of measure where the Xi are still independent but now 
Weib(ai,?;^“^) distributed, for some Vi > 0. We write this symbolically as 

Xi ~ Weib(ai,t;ri) . 



It is readily seen from (3.57) that for fixed ai, i = programs 

(3.40) and (3.41) can be solved analytically, and the components of = 
{vt^i , . . . , Vt,n) and Vt = (ut,i, . . . , Vt^n) in the Weibull pdf are 



/ Ey,_, u, Vt-i)vr \ 



(3.58) 






(3.59) 



respectively. 

A different parameterization of the Weibull distribution gives an even sim- 
pler formula. Namely, if we use the change of measure 



Xi ~ Weib(ai,u7/“‘) ^ Weib(ai,t;r^/“^) , 



l/tti 



thus, 

/(x; Vi) = ai v^^ 

Then the ^;-parameters are updated as 

^ T.tihw,xii 

zLk=i hWk 



(3.60) 



where we have used the abbreviations Ik = I{S{^k)'^^t} = W(Kk’,u, 

Vt-l). 



Remark 3.16 (Two-parameter update). For the Weibull distribution it is not 
difficult to formulate a two-parameter updating procedure in which both scale 
and shape parameter are updated. Specifically, consider the change of measure 



Xi ~ Weib(ai,w7^/“^) ^ Weib(6i,t;“"/'’‘), 






> 0 , > 0 . 



The updating formula for the Vi is given in (3.60), but an analytic updating of 
the parameter vector b = (61 , . . . ,6^) is not available from (3.41). However, 




3.5 Examples 



81 



it is readily seen that the i-th component of Vb ln/(X; b, v) for the random 

vector X with independent components Xi rvj 

equals 

+lnXi-^]nXi. (3.61) 

Consequently, the i-th component of b can be obtained from the numerical 
solution of the following nonlinear equation: 



1 ^ X^^ 

^ Y] hWkib-^ + InXki - InXfei) = 0 . 

k=i 

Substituting Vi from (3.60), into (3.62) we obtain 

^-1 I EtihW.lnX,, EtihWkX^llnXki 

' EtilkW, EtihWkXli 

One might solve (3.63) for example using the bisection method. 



(3.62) 



(3.63) 



Remark 3.17 (Two-parameter families). As noted in Remark 3.16 it is some- 
times useful to consider two-parameter families in which both parameters 
have to be updated. Consider the univariate case where X is a random vari- 
able with a density f{-;u). We wish to estimate £ = P(5(X) > 7), via IS, 
using a member of the two-parameter family {5'(-;^)}? where r} = (^71,7/2). 
The optimal CE parameters follow, in the usual way, from minimization of 
the Kullback-Leibler distance to the theoretically optimal IS density for this 
problem. Specifically, the optimal parameter vector 77* is the solution to the 
maximization problem 



m^D{r]) = m&xEe ^s{X)^-(}W{X;u,6) ]ng{X-,rj) , (3.64) 

where X ~ g{-\0) and W{X\u,6) = f{X;u)/g{X;9). As usual it can be 
estimated by solving the stochastic optimization program 

1 ^ 

max.D{ri) = max — '^I{s{Xi)^y}W{Xi;u,6) lng{Xi;rt) , (3.65) 

2=1 

where Xi,...,Xiv is a random sample from This is very similar to 

the “standard” CE procedure (3.40) and its stochastic equivalent (3.41). Con- 
sequently, the main CE Algorithm 3.4.1 and its modifications can be easily 
adapted to the deployment of the family {^(*;^)} instead of {/(*; v)}. 

We give two examples where the updating formulas are easy, because — as 
noted in Remark 2.8 — solving (3.65) is very similar to deriving the maximum 
likelihood estimator of ry on the basis of the random sample Xi, . . . , X^. The 
only difference is the presence of the “weighting terms” liWi (using abbrevi- 
ations similar to those in (3.60)). 




82 



3 Efficient Simulation via Cross-Entropy 



Normal Distribution 

Consider the density 

9{x-,v) = , a;eR, r]={ix,a^). 

v27Tcr^ 

The optimal solution of (3.65) follows from minimization of 

. N N 

{Xi - ixf + ln{a^) 

^ 1=1 i=l 

It is easily seen that this minimum is obtained at (/I, cr^) given by 

^ Yfi-xhWiXi 

9= rur 

Ei=i hWi 

and 

Shifted Exponential Distribution 

Consider the density 

3(a;;77) = , x ^ a, ■n={ri,a). 

Denote the CE optimal parameters by (ry,a). Let denote the 

order statistics of Xi, . . . , Xjv, and let be the indicator in {/i, . . . , /jv} 
that corresponds to Let K be the first index such that 7(x) = 1- Note 
that a must be less than or equal to the order statistic ^(i), since otherwise 
g{Xi] (g,a)) = 0, for some i. Thus, we have to maximize 

'£liWilnr]-rjIiWi{Xi-a) , 

% 

for a ^ X(i) and ry ^ 0. It follows that 

a = ^{K) 

T.iIiWi{Xi-a)- 



(3.66) 

(3.67) 



and 




3.6 Convergence Issues 



83 



Remark 3.18 (Hazard rate twisting). It is interesting to note that hazard rate 
twisting [91] often amounts to SLR. In hazard rate twisting the change of mea- 
sure for some distribution with pdf / (with support in R+) and tail distribu- 
tion function F is such that the hazard rate (or failure rate) \{x) = f{x)/F{x) 
is changed to (1 — 0)A(x), for some 0 ^ 6 < 1. The pdf of the changed measure 
is now 

where A{x) = \{y)dy. In particular, for the Weib(a, distribution we 

have A(rr) = au~^{u~^x)^ and A{x) = {u~^x)^, so that 

/^(x) = (1 - , 

which corresponds to the SLR CoM Weib(a, — > Weib(a, with 

v~^ = Similarly, for the Pareto(a, u~^) distribution, with F{x) = 

(1 + we have A(a:) = au~^{l -f u~^x)~^ and A{x) = aln(l -h u~^x), 

so that 

fe{x) = (1 - e)au-\l + , 

SO that hazard rate twisting with parameter 0 corresponds to the SLR change 
of measure Pareto(a, — > Pareto(6, with h = {1 — 6)a. Note that in 

the Weibull case the scale parameter u~^ is changed, whereas in the second 
case the shape parameter a is changed. 



3.6 Convergence Issues 

In this section we consider the convergence behavior of Algorithm 3.4.2. In 
particular, we investigate the conditions under which the algorithm indeed 
terminates. For this to happen, the sequence {'jt} has to cross level 7 at some 
stage T < 00 , at which point 7 ^ is reset to 7 and is determined from 
(3.43), from which it follows that 



vt = argmaxEu/{s(x)^^} ln/(X; v) = v* . 

V 

The next proposition gives a sufficient condition for convergence in the one- 
dimensional case. Recall that we wish to estimate £ = ¥{S{X) ^ 7 ), where 
X G M has a pdf /(•; u) belonging to a family {/(•; t;)}. Let F(-; v) be the cdf 
corresponding to the pdf fi'^v). 

Proposition 3.19 (Convergence). Let {/(*;^)} ^ NEF that is parame- 

terized by the mean v, and let S be an unbounded monotone increasing con- 
tinuous function on R. Suppose F{x;v) is monotone nonincreasing in v. If 
there exists c > 0 such that 



F{v + c; t?) ^ F{u 4 - c; u) for all v > u , 



(3.68) 




84 



3 Efficient Simulation via Cross-Entropy 



then any g satisfying 

g <1 — F{u + c; u) 

ensures that Algorithm 3.4-2 terminates in at most 7/c iterations. 
Proof. First, observe that 








> EuX = u = vq - 



This inequality follows from the fact that EuXI^s{x)^'yi} ~ is 

the covariance between X and /{5(x)^7i}? which must be positive since S is 
increasing. 

Second, let Avt = Vt — Vt~i be the increment of v at iteration t, and 
similarly let Ajt be the increment in 7. Suppose that for some t ^ 1 we have 
Avt > 0. Then, Ajt-^i > 0 as well. To see this, recall that ^ 7t) = 

g, so that 

F{xt;vt-i) = F{xt+i;vt) , 
where Xt = t ^ Let us write this as 

F{xt]Vt-i) = F{xt + Axt+i;vt-i + Avt) , (3.70) 



where Axt-\-i = Xt-^i — Xt. Since F{x;v) is monotone nondecreasing in x, 
monotone nonincreasing in v, and Avt > 0, we must have that Axt-^i > 0, 
which implies Ajt-\-i > 0- 

Third, suppose, conversely, that A^t > 0 for some t ^ 1. Then, also 
Avt > 0. This is proved as follows: Define 

^ EuXI[sjx)^^} ^ !^-i^^^xf{x-,u)dx 
^ E„/{s(x)^7> J^_,^^^f{x;u)dx 

We leave it as an exercise to show that 'll; has a strictly positive derivative. 
Since Vt = '0(7t), it follows that 7^ > 'jt-i implies Vt > Vt-i- 

Finally, we show that Algorithm 3.4.2 actually reaches level 7 in a finite 
number of iterations. A sufficient condition is that {7^} increases to oc, as t > 
00. This in turn is equivalent to {vt} increasing to 00, because '0(7) ^ ^'“^(7) 
is a continuous, monotone increasing and unbounded function. The condition 
Vt ^ 00 , or equivalently the divergence of the series '^Avt^ depends on the 
choice of g. Specifically, g needs to be chosen such that for every v > u the 
corresponding increment Av is greater than some fixed constant c > 0, with 

_ fr-^(l-s;v)(^-^)f(^<^)dX 



To find such a g observe that Av > F~^(l — g;v) — v, so that a sufficient 
condition for divergence of '^Avt is that F~^{1 — g;v) — v > c, which is 
equivalent with (3.69). In particular, if (3.68) holds, then any g satisfying 
(3.69) guarantees convergence of the CE algorithm with a “step size” of at 
least c. 




3.6 Convergence Issues 



85 



Remark 3.20. If (3.68) does not hold then we may still enforce the divergence 
of Avt by updating g in each iteration via 



Qt ^ 1- F{vt-\-c;vt) . 



Example 3.21 (Example 3.12 continued). Recall that in that example we had 
5(X) = min(Xi, . . . , Xn) with Xi ~ f{x;u) = . As seen earlier, 

the CE-optimal solution is v*^ = 'j u (see Example 3.4). Consider now 7^ 
and Vt defined in (3.42) and (3.43). Since the algorithm stops when 7^ ^ 7, 
and since the distribution of 5(X) is continuous, we can write 

7t = max {x ^ 7 : exp {-xn/vt-i) ^ g} 

= mm{'y,Cvt-i/n} (3.71) 

where C = — In ^ > 0. The parameter Vt defined can then be rewritten as 

vt=7t + u = min{'y,Cvt-i/n} -\-u . (3.72) 

Consider the unidimensional function g(v) = min{7, C^;/n} + u. It is easy to 
see that g has a single fixed point v. Figure 3.2 illustrates the convergence of 
the procedure. 




Fig. 3.2. Convergence of Algorithm 3.4.2. 



It is easy to see that v = v* if and only if 

7 ^ {C/n)v . (3.73) 

Let us compute v. If C/n > 1, that is, if C ^ n, then we have v = 'y-\-u = v*. 
On the other hand, if (7/n < 1, that is, if C < n, then we have two cases: 



86 



3 Efficient Simulation via Cross-Entropy 



1. 7 < (C/n)(7 + -u): in this case, we have that ^(7 + u) = 7 + it, that is 
V = ^ -\-u. 

2. 7 > {C/n){'y + It): in this case, we have that v = u/{l — C/n). 

In both cases it is easy to check that the optimality condition (3.73) becomes 
7 < (C/n)(7 + It), that is C > ^7/(7 + u). Since C = — ln(p), it follows that 
Algorithm 3.4.2 will converge to the correct solution if and only if 

g < exp (- • (3.74) 

V 1 + uJ 

Moreover, if ^ ^ exp(— n) (which implies (3.74)), that is, if C/n > 1, then the 
differences Vt — Vt-i increase until the point when 7 is hit by 7^. On the other 
hand, if p > exp(— n), then the differences Vt — Vt-\ decrease until the point 
when 7 is hit by 7^. 

At first sight, condition (3.74) seems discouraging, since it requires the 
parameter q to decrease exponentially with n. Notice however that this exam- 
ple constitutes an intrinsically difficult problem — from (3.45), we see that 
the probability being estimated goes to zero exponentially in n under any 
parameter. It makes perhaps more sense to consider the behavior of (3.74) for 
fixed n — then we see that g < exp(— n) is a sufficient condition for Algorithm 
3.4.2 to work, regardless of the value ofj. We may also consider what happens 
when 7 is allowed to vary with n; for example, when ^ = A/n for some Z\ > 0, 
condition (3.74) becomes asymptotically g < exp{—A/u). 

Example 3.22 (Example 3.13 continued). Recall that in that example we had 
5(X) = -I- ••• + Xn with Xi ~ f{x\u) = u~^e~^'^ \ As seen earlier, 

the CE-optimal solution is given by (3.50). To study the convergence of the 
deterministic Algorithm 3.4.2, we define 

fc =0 

and let R~^ denote its inverse on (0,1). To find 7^ for t = 1, 2, ... we have to 
solve 7t from 

Rn{lt/vt-l) = g , 

and check whether this 7^ > 7, in which case we “reset” 7^ to 7. In other 
words, 

7t = miii{7, vt-i . 

Moreover, for large 7^ we saw in (3.51) that Vt ^ {u-\- 'jt)/n. Thus, 

Vt w mm i , ^ . 

[ n n J 

Consequently, we are in the same situation as in the previous example. In 
particular, if we define 




3.6 Convergence Issues 



87 



g{v) = min 



7 + n 

n 



^Rn^{Q) + u\ 

n / ’ 



then t’l, t’2, • • • converges to the fixed point {'y u) / n of g if and only if 



u ^ ti + 7 
n - Rn^{g) ^ n 



which is equivalent to 



Q ^ Rfi 



u + 7/ 



Note as 7 grows larger, Rn j tends to 1/2, showing that even a ^ as large 
as 1/2 guarantees convergence. 



Example 3.23 (Example 3.7 continued). Consider the deterministic Algorithm 
3.4.2 for the discrete uniform n-point density, given by 



m 



l,...,m . 



It is readily seen that in this case (3.35) becomes 



V, = < 



0, ifl^i^7 — 1, 

1 



(3.75) 



I m — 7 4- 1 



, if 7 ^ ^ m , 



assuming again that 7 C {l,...,m}. Now, consider the sequence of pairs 
{(7t5Vt),t = 1,2 ,...} for the multilevel approach of Algorithm 3.4.2. First, 
for a given v^, t = 1,2 ,... (where vq = u, by definition), we have 



7t+i = max{x : Pv* {X ^ x) g} . 



Prom (3.75) we see that Vt = {vti^ • • • ? ^tm) satisfies 



Vti = < 



0, if 1 ^ 2 ^ 7t - 1 , 
^ -,if7t^i<m. 



m - 7t 4- 1 

for a given 7^. In particular, we have for t = 1, 2, 3, . . ., 



Pv,(X ^ 7) = 



m - [7] + 1 
m - 7t + 1 



It follows that for t = 1 , 2, . . . , 



7t+i = [{m + 1)(1 -q) + Qlt\ , 




88 



3 Efficient Simulation via Cross-Entropy 



and, similarly, 71 = [1 -h m(l — ^)J. 

Due to the rounding operation, it is difficult to compute a closed form 
solution for 7^. We can however use the bounds 

+ q) + < 7i+i < + l){l - q) + Q^t ■ 

By applying these bounds recursively, we obtain that 

7t+i ^ ((m + 1)(1 - e ) - 1 ) (1 + ^ + • • • + 

7t+i ^ ((™ + 1)(1 — ^)) (1 + ^ H 1- ) 



that is. 



1 - 



1 

1 - ^ 



(1 - ^ 7t+i ^ (m + 1)(1 - . 



We can infer from the above inequality that, if m + 1 — > 7 — 1, that is. 



if^< (m — 7+1)/ (m — 7 + 2), then for t large enough we have 7*^1 ^ 7. 
Therefore, at some stage T of Algorithm 3.4.2, we obtain that 7t = 7 and 
yt = v*. In other words, the algorithm finishes after a finite number of steps 
provided 



Q < 



m — 7 + 1 
m — 7 + 2 



(3.76) 



The examples above illustrate that the parameter g used in the CE algo- 
rithm plays a crucial role: we can only expect the CE algorithm to converge 
to the correct values if g is sufficiently small. To determine, however, a priori 
a desired g can be a difficult task. 

A way to overcome this problem is to modify the basic CE algorithm in 
such a way that either gox N or both are changed adaptively. We will consider 
such a modification, called the fully adaptive cross entropy FACE method, 
in more detail in Chapter 5; see also [145] for related ideas. However, for the 
rest of this section we briefiy consider a modification proposed in [85], which 
paper provides various proofs of convergence for the CE algorithm. 

Let V* be a CE-optimal solution, that is. 



V* G V* = axgmax Eu/{s(X)^7> ln/(X;v) , (3.77) 

V 

where we have not assumed that the maximizer is unique — the argmax set 
V* may contain more than one element. 

The basic assumption in [85] is the following: 

Assumption A: There exists a set U such that Uf\V* and Pv(aS'(X) ^ 
7) > 0 for all V G W. 

Assumption A simply ensures that the probability being estimated, i = 
P(5(X) > 7), does not vanish in a neighborhood of the optimal parameter 




3.6 Convergence Issues 89 



V*. The assumption is trivially satisfied when the distribution of 5(X) has 
infinite tail. For finite support distributions the assumption holds as long as 
either 7 is less than the maximum value of the function 5(x), or if there is a 
positive probability that 7 is attained. 

We present below from [85] the modified version of Algorithm 3.4.1. The 
algorithm requires the definition of constants g (typically, 0.01 ^ g ^ 0.1), 
P > 1 and (5 > 0. 

Algorithm 3.6.1 (An Adaptive CE Algorithm) 

1. Define go = g, vq = u, Nq = N (original sample size). Set t = 1. 

2. Generate a random sample Xi, . . . ,XjVt_i from /(•; Vt_i). Let % be the 
sample {1 — gt-i)- quantile of performances 5(Xi), . . . , 5(XjVt_i); provided 
this is less than 7. Otherwise put % = 'j. 

3. Use the same sample Xi, . . . , XjVt_i to solve the stochastic program 
( 3 . 41 ), with N = Nt-i. Denote the solution by v^. 

4 . If^^='y then proceed with Step 5. Otherwise, check whether there exists 
a g ^ g such that the sample (1 — g)-quantile of aS'(xi), . . . ,5'(xiVt_i) is 
bigger than or equal to min{7,7t_i + 5}.' 

a) If so, choose gt-i the largest such g and let Nt = Nt-i; 

b) Otherwise set gt = gt-i and N = fiN. 

Set t = t-\-l and reiterate from Step 2. 

5. Estimate the rare-event probability £ using the estimator (3.36), with v 
replaced by vt, where T is the final iteration counter. 

It is proved in [85] that, under Assumption A, Algorithm 3.6.1 converges 
with probability 1 to a solution of (3.24) (with if(X^) = i^{5(Xi)^7> 

JSf = Nt) after a finite number of iterations T. 

We can then compare the approximating solution vt and the “true” so- 
lution V* using the asymptotic analysis for optimal solutions of stochastic 
optimization problems discussed in [149]. In particular, we obtain the con- 
sistency result that, as A" 00, the distance between vt and the solution 
set defined in (3.77) goes to zero (with probability 1) provided that: i) the 
function v 1-^ ln/(x; v) is continuous, ii) the set U defined in assumption A 
is compact, and iii) there exists a function h such that Eu/i(X) < 00 and 
I ln/(x; v)| < h(x) for all x and all v. 

Remark 3.24- Instead of using a constant S in Algorithm 3.6.1 one can also 
apply a dynamic procedure where S is changed according to the differences 
in 7t. More specifically, we can set St = % — between Steps 3 and 4 of 
the algorithm, and use St in Step 4. The idea behind that is to update gt as 
soon as the differences in the 7’s start decreasing. Such a procedure does not 
affect theoretical convergence and typically makes % reach 7 faster; however, 
in some instances this might cause jt to increase too fast, which in turn will 
yield a poor estimate vt in Step 2 of the algorithm, due to the (relative) rarity 




90 



3 Efficient Simulation via Cross-Entropy 



of the event {*5(X) > 7t-i} in (3.41). Thus, the more conservative choice 
J=constant is recommended unless some pilot studies can be performed. 

An even more conservative approach is to take J = 0 until the sequence 
{%} “stalled,” at which point a positive S is used again. This approach 
yields the slowest progression of 7t, but in turn the estimate vy is more re- 
liable. Notice however that, even if the optimal v* could be obtained, some 
problems might still require a very large sample size in the final estimation 
Step 5; see the discussion in Section 3.5. Given the limitations of one’s com- 
putational budget. Algorithm 3.6.1 can be used to detect such a situation — 
the algorithm can be halted once Qt in Step 4 gets too small (or, equivalently, 
when Nt gets too large). 



3.7 The Root-Finding Problem 

We briefly discuss now an application of the CE method to root finding. In 
many practical situations we need to estimate, for given £, the root 7 of the 
nonlinear equation 



Pu(5'(X) ^ 7 ) = Eu/{ 5 (x)^ 7> = ^ (3.78) 

rather than estimate £ itself. An estimate of 7 in (3.78) based on the sam- 
ple equivalent of Eu/{5(x)^7> can be obtained, for example, via stochastic 
approximation [148]. 

Alternatively, one can obtain 7 using the CE method. In particular. Algo- 
rithm 3.4.1 can be modified as follows: 

Algorithm 3.7.1 (Root-Finding Algorithm) 

1. Define vq = u. Set t = 1 (iteration = level counter). 

2. Generate a sample Xi,...,Xiv from the density /(•;vt_i) and compute 
the sample (1 — g) -quantile % of the performances according to (3.39). 
Calculate 

^ 1 ^ 
i=l 

provided this is greater than £, otherwise set £t = £. 

3. Use the same sample Xi, . . . ,Xjv to solve the stochastic program (3.41)- 
Denote the solution by Vt . 

4 . If £t = £ proceed to Step 5; otherwise let t = t-\-l and reiterate from Step 

2 . 

5. Let T be the final iteration number. Generate a sample Xi, . . . , X^Vi from 
the density f{-]VT) and take as estimator of^ the smallest number'^ such 
that 

1 

u, vtK £ . 




3.8 The TLR Method 



91 



3.8 The TLR Method 

In this section we present the transform likelihood ratio (TLR) method as a 
simple, convenient and unifying way of constructing efficient IS estimators 
that are applicable for both light- and heavy-t ailed distributions. 

Let X be a random vector. Suppose we wish to estimate 

i — E/{5(x)^7} • 

The TLR method comprises two steps. The first is a simple change of variable 
step. That is, we write X as a function of another random vector Z, for 
example 

X = H{Z) . (3.79) 

If we define _ 

S{Z) = S{H{Z)) , 

then 

Suppose Z has density /i(-; 6) in some class of densities {/i(*; rf)}. Then we 
can seek to estimate £ efficiently via IS using either the SLR method (staying 
in the same parametric class) or ECM. The parameter updating can again 
be done via the CE method. In particular, when using the SLR method we 
obtain in analogy to (3.36) the estimator 



where 



wiZi-,e,t)) 



h{Zi-,e) 

h{Zi;r]) 



and Zi ~ h{z;rj). We shall call the SLR estimator (3.80) based on the trans- 
formation (3.79), the transform likelihood ratio (TLR) estimator. 

To find the optimal parameter vector ry* of the TLR estimator (3.80) we 
can solve in analogy to (3.40) the following CE program: 



maxD(r 7 ) = max 



(3.81) 



and similarly for the stochastic counterpart of (3.81). For example, h(z;0) 
might be any light-tail NEF pdf, (and thus, the optimal reference parameter 
vector rj* could be obtained analytically from the stochastic version (coun- 
terpart) of (3.81)), or /i(z; 9) might be a truncated version of the original pdf 
/(x), denoted as /(x;c), where the truncation parameter c could be control- 
lable as well. 

It is crucial to understand that in contrast to the SLR estimate (3.36), 
its TLR counterpart (3.80) involves an additional stage, namely it uses the 




92 



3 Efficient Simulation via Cross-Entropy 



transformation stage (3.79). As result, the TLR estimate (3.80) presents a 
three-stage procedure rather than on a two-stage one (see (3.36)). The three 
stages of TLR are associated with 

1. Transformation from the original pdf / to an auxiliary one h. 

2. Updating the parameter vector r/ (at each iteration of Algorithm 3.4.1) 
using the stochastic counterpart of (3.81). 

3. Estimating i according to (3.80) with rj replaced by fj* , which presents 
the solution obtained from Algorithm 3.4.1 at stage two. 

Example 3.25 (Inverse-Transform Likelihood Ratio). Consider the single-di- 
mensional case. According to the inverse transform (IT) method (see Sec- 
tion 1.7.1) a random variable X F can be written as 

X = F-^{Z) , (3.82) 

where Z ~ U(0, 1) and F~^ is the inverse of the cdf F. 

Let h(-;z/) be another density on (0,1) dominating the uniform density, 
and parameterized by some reference parameter v. Thus, h{x, v) > 0, for all 
0 ^ X ^1. An example is the Beta(i/, l)-distribution, with density 

h{z; v) = V z G (0, 1) , 

with 1 / > 0 or the Beta(l, z/)-distribution, with density 

h{z] v) = v{l — zY~^, z G (0, 1) . 

The TLR estimator is given by 

N 

(3-83) 

2=1 

where Zi, . . . , Z^v is a random sample from h{‘\u) and 

is the LR. We call (3.83) the inverse transform likelihood ratio (ITLR) esti- 
mator [144]. 

Consider next the multivariate case where the components of X = {Xi , . . . , 
Xn) are independent and Xi ~ F{-;Ui) for a fixed parameter vector u = 
(i/i, . . . , iXn)* In analogy with the univariate case we wish to estimate, for 
some performance function 5, 

£ = E/{S(X)^^} = > 

where 5(Z) = 5(F~^(Zi;ui), . . . Z = and 

Zj, j = 1, . . . , n are i.i.d. and uniformly distributed on (0,1). 




3.8 The TLR Method 



93 



Let h{-;u) be another density on (0,1)’^ dominating the uniform den- 
sity, and parameterized by some reference parameter vector u. For example, 
we could choose h such that the Z^’s are independent with a Beta(l,z/i)- 
distribution, in which case 

n 

= ZG(0,1)”, (3.85) 

2 = 1 

with u = (i^i, . . . , i^n)- As in the univariate case we have the ITLR estimator 

(3.86) 

2=1 

respectively, where Zi, . . . , Z^v is a random sample from /i(-; u) and 

Note that Algorithm 3.4.1 remains the same for the ITLR approach, pro- 
vided the CE programs (3.40) and (3.41) are replaced by 

, (3.88) 

and 

1 ^ ~ 

- mM ^ E ’ (3.89) 

2=1 

respectively, where Z^ ~ . 

In particular, for the case (3.85) where the Z^’s are independent and Z^ ~ 
Beta(l, Ui)y i = 1 , . . . , n (3.89) can be solved analytically, and it is not diflBcult 
to see that the components of i/ = (i^i, . . . , ^'n) are updated as 

^ ’ (3.90) 

2=1 

where Zij is the j-th component of Z^. 

The following example shows that (I) TLR can lead to a more efficient 
estimator than the SLR method. 

Example 3.26 (Example 3.12 continued). Suppose, as in Example 3.12, that 
we are interested in estimating t = P(5(X) > 7 ), where 




94 3 Efficient Simulation via Cross-Entropy 

5(X) = min(Xi,...,Xn), - Exp{u-^) . (3.91) 

In this case we can write 

Xi = —u ln(l — Zi) , i = 1, . . . , n , (3.92) 

where Zi ~ U(0, 1), i = 1 , . . . , n and Zi, . . . , independent. We have 



5(Z) = min(— itln(l — Zi)) = —u\n{l — minZ^) , 
i i 

SO that _ 

£ = P(5(Z) ^ 7 ) = P(minZ^ > 1 — ry) , 

i 

with 7] = ^ . 

Let h(z; v) = OlLi > 0 be the dominating density on U’^(0, 1) 

for Z. Note that (by symmetry) we choose all component pdfs the same^ this 
in contrast to (3.85). To find the optimal v we need to solve the CE program 
(3.88), which for this case reduces to 

n 

maxD(i/) =maxE/{romiZi^i- 7 ,}y'(lnJ^ + (i^-l) InZi) . 

i />0 i />0 ^ ' 

i=l 

Equating the gradient with respect to 1 / to 0 gives 

xc 

E-f{mini Zai-n} Yli=l In Zi 

nrf' T] 

In z dz ln(l - »?)(1 - »?) + »7 ' 

It follows that for small rj we have 




(3.93) 



To find the a^mptotic SCV we need to find first the vari^ce of the 
ITLR estimator £. Let V{u) be the second moment of 
We have 



i=l 



h{Z-,v) 







V K2-t^) ) 



(3.94) 




3.8 The TLR Method 



95 



From (3.94) and (3.93) we have for small rj 



So that, for small r] 



For £ we have 



4n 



e = l[F{Zi^l-r,) 

2=1 




Finally, 



Nxk^ 



£2 




(3.95) 



Note that in (3.95) does not depend on r] and therefore neither does 
it depend on 7. Consequently, the corresponding estimators have hounded rela- 
tive errorin7. Comparing (3.95) with AT'x/^2 = 7’^e’^(2u)”’^ = (— lnry)’^(e/2)^ 
in (3.48), it readily follows that the former (ITLR) is much faster than the 
latter (SLR), especially when 7 is large. 

Remark 3.27. Note that by minimizing (3.94) we can obtain the VM optimal 
solution, say It can be shown that similar to (3.93) we have, for small 77, 



c 

V 



(3.96) 



where c = 1.59362 ... is the solution to the equation 



2 - (2 - c)e^ = 0 . 



Hence, this is an example where the VM and CE optimal solutions do not 
coincide asymptotically. However, both solutions give bounded relative error. 

The following proposition illustrates the usefulness of ITLR for estimating 
small probabilities, for any distribution. In the results below the univariate 
ITLR method is used with a Beta(i/, 1) change of measure. It is important to 
realize that this CoM may not be appropriate for similar problems concerning 
multivariate random variables. Indeed the Beta(z/, 1) CoM may give exponen- 
tial complexity, whereas a Beta(l,z/) CoM could give polynomial complexity. 

Proposition 3.28. Let X = L{1 — Z), with Z ~ U(0, 1), for some monotone 
increasing function L on (0,1). Then, estimating £ = P(X ^ 7) via ITLR 
using the {Beta{u,l),u > 0} family of distributions gives an LR estimator 
with hounded relative error. 




96 



3 Efficient Simulation via Cross-Entropy 



Proof. The proof uses similar arguments to the ones used in Example 3.26. 
First, we write ^ = P(X > 7 ) as ^ = P(Z ^ 1 — ry), with r] = Hence, 

if we estimate £ via the IS density 

h{z;u) = uz’'-\ (3.97) 

then the optimal CE parameter is given, analogously to (3.93), by 

^ « 2 

»? + (1 - 77) ln(l - 77) 77 ’ 

as ry 0. Moreover, the corresponding SCV satisfies 

p 2 _ 1 

AT X « — 1 « 0.597264 . (3.98) 

Note that this is independent of ry (and hence 7 ). Thus, the estimator has 
bounded relative error. 

Example 3.29 (Example 3.14 continued). Let X ~ Weib(a, u”^). That is, X 
has cdf F given by 

F(x) = x^O. 

We wish to estimate ^ = P(X ^ 7 ) = e“^“ for large 7 . Using 
L(z) = u (-lnz)“ , 2 G( 0 , 1 ), 

we can write i = ^ 1 — 77), with Z ~ U(0, 1) and 77 = e~^“ Hence, by 

Proposition 3.28 we can efficiently estimate £ via ITLR using the Beta(i/, 1) 
density, yielding an estimator with bounded relative error given in (3.98). 
Note that this is true for any shape parameter a > 0, including the heavy-tail 
case 0 < a < 1 . 



3.9 Equivalence Between SLR and TLR 

As we have seen the TLR method can be viewed as a universal device involving 
an additional transformation step as compared to SLR. Its main advantage 
is computational, since using TLR one can readily transform any underlying 
problem with any (arbitrary) input pdf, such as Weibull, to an equivalent one, 
such as the exponential, for which the updating of the parameter vector can 
be performed analytically. In this section we illustrate that seemingly different 
TLR and SLR methods can in fact be probabilistically equivalent. So, in the 
cases below we cannot get a more accurate estimate while switching from SLR 
to TLR. 

Let Xi, A 2 , . . . , Xn be i.i.d. Weib(a, u~^) distributed and consider the es- 
timation of a general rare-event probability £ = P(5(X) ^ 7 ) for large 7 using 
importance sampling. We consider three methods. 




3.9 Equivalence Between SLR and TLR 



97 



(1) SLR with Weib(a, V twisting, fixed a 

The first method is a straightforward change of the Weibull scale parameter, 
as in Example 3.14. In particular, we consider the change of measure 

Xn ~ Weib(a, u~^) — > Weib(a, v~^) , v u . 

Note that the problem is of the form discussed in Remark 3.15, but by sym- 
metry we know that the components of the reference vector must be equal. 
This leads to slightly different updating formulas, namely: 

Eti I{SiW%}W{X,; u, vt-i) n-^ Er=i 
J2k=l 0^k‘, U, Wt-l) 

(2) ITLR with Beta(l,z/) twisting 

In the second method we estimate £ via the ITLR method. First, write Xi ~ 
Weib(a, as 

Xi=u(-ln(l-Zi))^/“ , 

with the Zi i.i.d. U(0, 1) = Beta(l, 1). We now apply a change of measure on 
the distribution of 

Zi ~ Beta(l, 1) — > Beta(l, u) 0 < u . 

Define S{Z) = 5(X). The CE updating formula is, similar to (3.90), 

N 

% = Ei— ^ , (3.100) 

i=l j=l 

where Zij is the j-th component of Z^. 

It is interesting to compare the present ITLR method with the previous 
Weibull change of measure. Since, Zi can be written as Z^ = 1 — (1 — 
with Ui ~ U(0, 1), we have 

SO that under the change of measure Zi ~ Beta(l, 1) — > Beta(l,z/) we have 
that Xi ^ Weib(a, Let us compare the behavior of the SLR and 

ITLR estimators for v = . First of all, observe that 






98 



3 Efficient Simulation via Cross-Entropy 



W^(X;u,^;) = n 
2=1 



au ^{u 



=n 



1 



W{Z;l,u). 



This shows that 

N N 

2=1 2=1 

In other words, the SLR estimator is identical to the ITLR estimator, provided 
we take v = Note also that, in the same way, the CE updating 

formulas and their deterministic counterparts are equivalent, in the sense that 
Vt = and Vt = 



(3) TLR with Exp(A) twisting 



Let us finally apply the TLR method with an “exponential change of mea- 
sure.” We now write Xi ~ Weib(a, ti“^) as 

Xi = , 



with the Zi i.i.d. Exp(l), and apply the change of measure 
Zi ~ Exp(l) — > Exp(A), 0 < A < 1 . 
With 5(Z) = 5(X) the CE updating formula is given by 

N 

X] 1, At_i) 

T _ ^=1 

— 

2=1 j=l 



(3.101) 



where Zij is the j-th component of Zj. 

Since, Zi can be written as Zi = ln(l — [/*), with Ui ~ U(0, 1), we have 

Xi =wA-i/“ ln(l-J7i)i/“ , 

so that under this change of measure Xi ~ Weib(a, Repeating the 

arguments of the ITLR method above, we find that this approach is equivalent 
to the two methods above, provided that we take \ = v= (u/vY. 

Remark 3.30. (Sum of independent random variables) The special case 
where 5(X) = Xi + • • • -h X^, and the Xi are i.i.d. with a subexponential 
distribution was studied in both [14] and [91] via various methods, as explained 




3.9 Equivalence Between SLR and TLR 



99 



in the introduction. In particular for the heavy-tail Weibull case, Juneja and 
Shahabuddin [91] (see their Theorem 3.2) proved that the change of measure 

Xi ~ Weib(a, 1) — + Weib(a, (3.102) 



provides a logarithmically efficient estimator, in the sense of (1.14), when we 
choose 

ry = c7"“ , (3.103) 

no matter how c is chosen. On the other hand [14] proposed an importance 
sampling distribution independent of 7 which is consistent with the fact that 
T] 0. In Appendix 3.13 we prove for the case n = 2 the somewhat stronger 
result that the estimator is in fact polynomial and that the variance of the 
estimator is minimized for c = 2; we conjecture that for general n the variance 
minimal (VM) parameter is 

= n7~“ . 

In [15] it is shown that for large 7 the optimal CE parameter, r/* say, is 
indeed given by ^r] above. More precisely, the following argument is used: 
Let Xi,A 2 ,... be i.i.d. random variables distributed according to a subex- 
ponential distribution with respect to the Lebesgue measure. For large 7 the 

asymptotic distribution of Ai, . . . , given Xi ^ 7 is such that 

with probability 1/n one of the random variables Xi, . . . , say X^, is dis- 
tributed according to the conditional distribution of (X^ \Xi ^ 7), and the 
other {Xj^j ^ %} are distributed according to the original distribution. In 
particular, consider Xi, X2 , . . . ~ Weib(a, 1), with 0 < a < 1. The optimal CE 
parameter rj* for the SLR change of measure (3.102) can be written as 



77* = 1/E 






i=l 



Xi + • • • + Xn ^ 7 



Using the fact that the Weibull distribution with 0 < a < 1 is subexponential 
it follows that for large 7 



E 






Xi + • • • + Xn ^ 7 



i=l 



(3.104) 



.-1 



(n - 1)EX“ -h n~^E [X“|X ^ 7] . 



Since the conditional distribution of X given X ^ 7 has pdf ax^ ^e 
X ^ ^ the expectation in (3.104) is given by 



n — 1 
n 



+ i(l + 7“). 



It follows that for large 7 

* ^ 

n « . 

n + 7“ 

Similar results are obtained for the Pareto distribution. 



(3.105) 




100 3 Efficient Simulation via Cross-Entropy 



Remark 3.31 (Other changes of measure). Various other ideas for selecting a 

good change of measure for the problem £ = P(Vi H h ^ 7 ) with i.i.d. 

heavy-tail Xi are considered in [15]. We mention the idea of estimating i via 

P(Vi + • • • + Xn ^ 7 , Xi > X 2 , . . . , Xi > Xn) = ijn ., (3.106) 

where the left-hand side can be estimated via a change of measure in which 
only the distribution of X\ is changed. This leads to a significant variance 
reduction. 



3.10 Stationary Waiting Time of the GI/G/1 Queue 

In this section we give an application of the CE method to a dynamic model, as 
opposed to the static models considered so far. We will see that the formalism 
carries through without much alteration. 

Consider a stable GI/G /1 queue starting with customer n = 1 arriving at 
an empty system. Let the interarrival time between customer n and n + 1 be 
denoted by An ~ n = 1 , 2 ,... and let the service time of customer n be 

denoted by Bn ~ . We assume that all the service and interarrival times 

are independent. Let Sn denote the actual waiting time of the n-th customer; 
hence, by definition 5i = 0. The stochastic process {Sn,n ^ 1 } satisfies the 
celebrated Lindley equation (see for example [ 12 ]) 

*5n+l = {Sn + Xn)~^ , 

with Xn = Bn — An, i = 1,2 — For a stable system the random variables 
{Sn} converge in distribution to the steady-state waiting time, S say. 

We are interested in estimating £ = ¥{S ^ 7) via importance sampling. 
We consider two methods. 

The regenerative method 

Using the regenerative method, see for example [148], we can write 

e = — ^ (3.107) 

]Ej (T 

where a is the number of customers during the first busy period, that is 
a = inf{n >1 : 5^ = 0 } — 1 . 



Define r as 



r = inf{n >1 : Sn ^ , 



In other words, r is the first time that the process {5^} exceeds level 7 , if at 
all. 




3.10 Stationary Waiting Time of the GI/G/1 Queue 101 

Consider now the following switching change of measure [148] . 

~ — >■ and Bn^ , forn = 1 , . . . , min(r, a) . 

In other words, the IS distribution changes^dy within the cycles. In 
particular, we initially use the IS densities and for the interarrival and 
service times until the process {Sn} exceeds level 7 , after which we switch 
back to the original densities; see Chapter 9 of [148] By doing so the process 
{Sn} naturally returns to the regenerative state. 

Under this change of measure the likelihood ratio of a sample Ai, . . . ,An, 
Bi, . . . ,Bn satisfies 



Wn=< 



Wn-i— ^ , n^mmfr.cr) 

Wr, n^min(r, (t) 



Prom [148], we can write 









Ecr 



Eor 



(3.108) 



(3.109) 



Note that the denominator of (3.109) can be easily estimated via CMC (no 
change of measure here). The numerator of (3.109) (num) can be estimated 
as 

^ N at 

^ = iv ^ ^ hSin>l}Win , (3.110) 

i=\ n=l 

where Sin and Win are the waiting time of the n-th customer and the corre- 
sponding likelihood ratio, for iteration i. 

Now consider the special case yli, ^ 2 , • • • ~ Weib(ai, ^) and Bi, B 2 , . . . ~ 
Weib(a 2 , 1^2 ^)- Using the TLR method, we may write 



Xn = U2 




-Ui 




with ~ Exp(l), A; = 1, 2, n = 1, 2, . . ., so that 



Sn+l = + «2 , (3.111) 



with 5i = 0. Consider the following particular case of the switching change 
of measure described above: 



~ Exp(l) — > Exp(uj ^) and ~ Exp(l) — Exp(i ;2 ^ ^ min(r, cr) . 
Then (3.108) is given by 




102 3 Efficient Simulation via Cross-Entropy 



w„ = i 






Wn-l 11 Wfc e 
fc=l 

[Wr, 



n ^ min(r, a) 
n > min(r, a) 



(3.112) 



(h) 

Since the Z\ ^ are independent and have an exponential distribution we 
can apply again the standard CE technique to determine /estimate the optimal 
reference parameters v\ and V 2 for the estimator (3.110) and achieve variance 
reduction. In particular, if we define 



n—1 



with Z = . . . , Z^\z^^), then, similar to Example 3.5, we have 

E„H(Z)iy,E:„zy' . 

Ev//(Z)W^r ’ ’ ’ 

for any reference vector ^ = {vi^V 2 ). Note that in a multilevel CE procedure 
the updating rule for the level 7^ is not the usual quantile rule. Instead 7t 
should be chosen such that during each regeneration cycle at least a fraction 
Q of the customers has a waiting time ^ 7. 



Random Walk 

It is well known (see for example page 173 of [148]) that the steady-state wait- 
ing time for this queueing system has the same distribution as the supremum 
of the random walk {l^i, n = 1, . . .}, where Yi = 0 and 

Yi+i “ Xji , 7T, ^ 1 , 

with Xi = Bi — Ai^ i = 1,2..., and the Ai and Bi the same as before. Thus 
£ in (3.107) is the same as 



^ = P(supYn^7). (3.113) 

n 

Similar to (3.111) let us now (re-)define 

Sn+l =S„ + U 2 (zP) - Ml (zf , (3.114) 

with ~ Exp(l), A: = 1,2. Then, with S = sup^5n, the estimation 
of (3.113) (under the original pdfs {Weib(ai, u]^^)} and {Weib(a2, ^2 ^)}) is 
equivalent to the estimation of 



^ = P (5 ^ 7 ) . 




3.10 Stationary Waiting Time of the GI/G/1 Queue 103 



Thus, alternatively to /{sup^yn^ 7 }’ which employs Weibull random variables, 
we can simulate the random variable /{sup„ estimate which employs 

Exp(l) random variables and We can again apply the standard CE 
technique to find the optimal IS reference parameter. 

To proceed, define r as the first time {5n} exceeds level 7 or falls below 
some low level — L, that is 

T = inf{n >0 : 5n > 7 or 5n < —L} . (3.115) 

Consider the IS change of measure with z|^^ ~ Exp(i;^^). Typically, we look 
for an IS change of measure under which the queue has a positive drift. In 
that case 5r ^ 7 with high probability. For —L small enough we may write 
to a very close approximation 



i « F{Sr > 7 ) . 



It will be clear how we estimate the probability above: we run N samples of 
5i, . . . , 5r and evaluate the estimator 



£ 



1 



where 



= n n vt-i,ke 



— (l—v. 



t-l,k 






k=l n=l 



Applying the CE Algorithm 3.4.1 it is readily seen that the deterministic 
updating rules for Vt = are 



Vt,k 



Evt -l^{Sr ^ 7 t } 



(k) 



with vo,k = I 5 ^ = 1,2. This leads to the simulated updating rules 



Vt,k 









Ef=i hs. 



■i^7t} 



IFir,- 'T'i 



where the simulation is run under Vt-i. Note that the updating rules for the 
regenerative method and the random walk method are very similar. Indeed, it 
is reasonable to expect that the optimal CE parameters for the two methods 
should coincide for large 7 ; numerical results indicate that this is indeed the 
case. Finally we remark that some care should be taken with the choice of the 
low level —L. Typically, under the CE optimal parameter the system becomes 
unstable and hence —L can be safely set to — 00 , but for the first iteration the 
system is still stable and hence —L has to be chosen not too small in order to 
save CPU time. 




104 3 Efficient Simulation via Cross-Entropy 

Remark 3.32. It is important to set L in any simulation involving (3.115) 
large enough to obtain a valid estimator for the steady-state waiting time 
probabilities. The choice of L is somewhat arbitrary. An alternative approach 
is to take L = 0 and let ^ correspond to the probability that the waiting time 
process exceeds level 7 during a busy period. This is called the transient setting 
in [148], section 9.3.2. In our numerical results we will consider examples of 
both cases. 

Remark 3.33 (Exponential complexity). Although we show in [15] that both 
methods described above have in general exponential complexity, the numer- 
ical results in Section 3.12 show that in practice the CE method yields an 
excellent speed up (variance reduction), for probabilities down to, say, 10“ 



3.11 Big-Step CE Algorithm 

Consider Algorithm 3.4.1. The algorithm is designed to find a reference vector 
vt such that 

Pvt^(<5(X) ^ 7) ^ ^ 5 

for a given (large) 7 and q not too small, for example, 0.01 or 0.1. However, 
the number of levels required (T) may be quite large. Here we propose the 
following simple alternative, called the big-step CE algorithm which uses only 
two levels. 

Algorithm 3.11.1 (Big-step CE Algorithm) 

1. Generate a random sample Xi,...,Xiv from the original distribution 
/(•;u). Let 7i be the (1 — g)- quantile of the performances, assuming that 
this is less than 7. Otherwise put 71 = 7. 

2. Apply the following gradient procedure: 

vi = u-h/?v5(u;u,7i) , (3.116) 

where V-D(u;u, 71), denotes the gradient at u of the function v 1-^ 
l^(v;u, 7 i) defined by 



S(v;u, 7 i) = ^^/{S(X,)^7i}ln/(Xi;v) , (3.117) 

2=1 

using the sample Xi, . . . , Xjv of Step 2, and j3, called the “big-step param- 
eter” is chosen (say by trial and error or by the bisection method) such 
that 

1 ^ 

2=1 

for some sample from /(•; vi). 



(3.118) 




3.12 Numerical Results 105 



3. Generate a sample from the density /(*;vi) and solve the 

stochastic program 

f 1 

V 2=1 

Repeat this step for several additional iterations, until the solution has 
converged. Denote the desired estimate by V2 . 

4- Estimate the rare-event probability i using the estimate (3.36), with v 
replaced by V2 . 

The rationale of the big-step method is that to estimate the optimal ref- 
erence parameter v* using CE it is only necessary to obtain, one way or the 
other, a reference vector vi such that 

This reference parameter is then used to estimate the optimal reference pa- 
rameter for the estimation of the rare event {5(X) ^ 7}. The algorithm sets 
out to find such a vi satisfying (3.118) by searching in the direction of the 
optimal reference parameter for estimating Pu(5(X) ^71). Numerical results 
with the big-step Algorithm 3.11.1 are given Tables 3.10 and 3.11 of Section 
3.12. 




3.12 Numerical Results 

This section presents simulation studies for the rare-event probability i = 
P(5(X) ^ 7) for several static and queueing models with both light- and 
heavy-tail distributions. The specific choices for 7 can be inferred from the 
tables. We shall employ both the SLR (3.36) and TLR estimators. 

Unless otherwise specified we set in all our experiments with Algorithm 
3.4.1 the rarity parameter g = 0.01, the sample size for Steps 2-4 of the 
algorithm N = 10^ and the final sample size A^i = 5 • 10^. Note that estimator 
£ is of the form £ = (2.26). The SCV kF' of each Zi is 

estimated as 

pr 

^ ^ 

(^)2 

The relative error of the estimator is thus estimated by RE = h?/N. 

Assuming asymptotic normality of the estimator — which we verified numer- 
ically in each case — the confidence intervals (Cl) now follow in a standard 
way. For example, a 95% Cl for £ is given by 

?zbl.96?RE . 

For a quite moderate probability like £ = 10“^, we typically compare the CE 
results with the corresponding CMC results. 




106 3 Efficient Simulation via Cross-Entropy 



3.12.1 Sum of Weibull Random Variables 

Our first model concerns five i.i.d. Weib(a, random variables with a = 5 
and a = 0.2, respectively. For both cases we selected u = 1. We wish to 
estimate 

Tables 3.2 and Tables 3.3 present, for the cases a = 5 and a = 0.2, respectively, 
the performance of Algorithm 3.4.1 for the TLR method 

Xi = Zi ~ Exp(l) Exp{v-^) (3.119) 

which is equivalent to the (one-parameter) SLR method 

Xi Weib(a,u~^) — > \Ne\b{a,v^^^^) . 



Table 3.2. The evolution of vt for estimating the optimal parameter v* with the 
TLR method (3.119), with a = 5. The estimated probability is £ = 1.6694 • 10“^, 
the relative error RE = 0.011763, and = 62.2 . 



t 


7t 


Vit 


V2t 


V3t 


V4t 


V5t 


0 


- 


1 


1 


1 


1 


1 


1 


5.7 


2.37 


2.42 


2.54 


2.49 


2.46 


2 


6.7 


5.52 


4.91 


4.84 


4.97 


5.20 


3 


7.0 


6.06 


6.04 


6.03 


5.93 


5.89 


4 


7.0 


5.99 


5.96 


6.02 


6.00 


5.99 


5 


7.0 


5.95 


5.90 


6.03 


6.04 


5.98 


6 


7.0 


5.95 


5.98 


6.04 


5.93 


5.98 


7 


7.0 


6.03 


5.93 


6.01 


6.01 


5.95 


8 


7.0 


6.00 


6.08 


6.02 


5.90 


5.95 



Table 3.3. The evolution of vt for estimating the optimal parameter v* with the 
TLR method (3.119), with a = 0.2. The estimated probability is £ — 6.54 • 10~^, 
relative error RE = 0.0278, and = 386 . 



t 


7t 


Vlt 


V2t 




V4t 


V5t 


0 


- 


1 


1 


1 


1 


1 


1 


9.7e-h003 


2.45 


2.25 


2.55 


1.97 


2.12 


2 


6.4e-h005 


3.06 


3.70 


4.28 


3.54 


4.62 


3 


1.0e-h006 


3.68 


5.82 


3.92 


3.34 


4.35 


4 


1.0e-h006 


4.37 


3.88 


4.13 


4.62 


3.67 


5 


l.Oe+006 


4.13 


4.47 


4.11 


3.77 


4.37 


6 


l.Oe+006 


4.15 


4.53 


3.98 


3.94 


3.99 


7 


1.0e-h006 


4.10 


4.22 


4.40 


4.11 


4.16 


8 


1.0e-h006 


4.18 


4.39 


4.35 


4.53 


4.11 




3.12 Numerical Results 107 



Note that in both cases Algorithm 3.4.1 reaches the desired level 7 after 
three iterations, but we have continued iterating Steps 2-4 of Algorithm 3 . 4.1 
in view of Remark 3.9. We see that the parameter vector stabilizes very 
quickly. Note also that we could have taken the average of the reference pa- 
rameter at each iteration as a more accurate estimate for the optimal reference 
parameter. 

The asymptotic value for optimal reference parameter v in the heavy-tail 
case is, see (3.105), given by 



1 ^ 1 + 7 ^ 
rj* n 

In particular for Table 3.3 we obtain a value of (1 + 10^-^)/5 3.4, which is 

not too far from the observed value of around 4.2. Note that for the light-tail 
case the above formula does not hold. Tables 3.4 and 3.5 present, for the same 
cases a = 5 and a = 0.2 as above, the performance of Algorithm 3.4.1 for the 
two-parameter SLR method 

Xi ~ Weib(a,u-1) ^ \Ne\h{bi,v~'^^^') (3.120) 



of Remark 3.16. 



Table 3.4. The evolution of bt and Vt for estimating the optimal parameters b* 
and V* with the two-parameter SLR method (3.120). The estimated probability is 
i = 1.6570 • 10“^, the relative error RE = 0.0041, and = 8.4 . 



t 


7t 


bit 


Vlt 


b2t 


V2t 


b3t 


V3t 


b4t 


V4t 


bbt 


V5t 


0 


- 


5 


1 


5 


1 


5 


1 


5 


1 


5 


1 


1 


5.19 


7.6 


2.8 


7.1 


2.7 


7.0 


2.6 


7.4 


3.0 


7.3 


2.7 


2 


5.87 


9.1 


8.7 


8.9 


7.9 


9.9 


10.4 


10.1 


10.1 


10.5 


10.5 


3 


6.41 


11.2 


28.5 


11.8 


37.4 


12.1 


39.8 


11.2 


34.3 


12.2 


34.5 


4 


6.86 


12.3 


70.6 


12.0 


106.4 


14.3 


153.1 


14.6 


238.9 


14.3 


158.2 


5 


7.00 


14.4 


231.5 


14.1 


250.1 


12.9 


109.0 


11.2 


92.4 


13.8 


179.2 


6 


7.00 


14.1 


201.9 


13.7 


206.5 


13.6 


167.1 


12.8 


128.6 


14.3 


246.6 


7 


7.00 


14.0 


211.9 


13.9 


209.5 


14.2 


206.0 


13.3 


167.5 


14.0 


205.5 


8 


7.00 


14.2 


202.6 


13.2 


193.8 


13.9 


183.3 


12.7 


133.9 


13.4 


193.8 


9 


7.00 


14.0 


194.4 


13.3 


195.3 


14.2 


201.3 


13.0 


146.1 


13.7 


195.6 


10 


7.00 


14.2 


200.7 


13.6 


191.7 


13.2 


185.0 


12.5 


124.8 


14.1 


202.6 



We see that both the one- and two-parameter methods give very accu- 
rate results for both heavy- and light-tail Weibull distributions, and that the 
TLR updating performs similar to its two-parameter counterpart, although 
repeated measurements indicate that for the cases above the RE is about two 
times smaller for the two-parameter TLR method. 




108 3 Efficient Simulation via Cross-Entropy 



Table 3.5. The evolution of ht and vt for estimating the optimal parameters b* 
and V* with the two-parameter SLR method (3.120). The estimated probability is 
i = 6.5964 • 10“^, the relative error RE = 0.014723, and = 108.3 . 



t 


It 


bit 


Vit 


b2t 


V2t 


bst 


V3t 


b4t 


V4t 


bbt 


V3t 


0 


- 


0.2 


1 


0.2 


1 


0.2 


1 


0.2 


1 


0.2 


1 


1 


971.28 


0.17 


1.55 


0.18 


1.69 


0.18 


1.63 


0.17 


1.48 


0.18 


1.52 


2 


28750 


0.15 


1.76 


0.15 


2.09 


0.15 


1.89 


0.15 


1.75 


0.14 


1.40 


3 


461370 


0.12 


1.86 


0.13 


1.84 


0.12 


1.43 


0.12 


1.53 


0.13 


2.38 


4 


1000000 


0.12 


1.50 


0.13 


2.17 


0.12 


1.83 


0.13 


1.62 


0.11 


1.93 


5 


1000000 


0.12 


1.59 


0.12 


1.66 


0.12 


1.92 


0.11 


1.66 


0.12 


1.99 


6 


1000000 


0.13 


1.68 


0.12 


2.02 


0.12 


1.96 


0.12 


1.83 


0.13 


1.91 


7 


1000000 


0.12 


1.72 


0.13 


1.97 


0.12 


1.87 


0.12 


1.77 


0.12 


1.87 


8 


1000000 


0.12 


1.81 


0.12 


2.05 


0.12 


1.90 


0.13 


1.94 


0.12 


1.67 


9 


1000000 


0.12 


1.95 


0.12 


1.70 


0.13 


1.88 


0.12 


1.66 


0.12 


1.88 



3.12.2 Sum of Pareto Random Variables 

Here we repeat the experiments of Tables 3.2 and 3.3 for the Pareto case. 
Specifically, we now let the Xi have a Pareto pdf f{x) = 
and consider the TLR change of measure 

Xi = It - l) , Zi ~ Exp(l) ^ Exp(t;ri) . (3.121) 

This is equivalent to the SLR change of measure 

Xi ~ Pareto(a, — > Pareto{av^^ ,u~^) . (3.122) 

Tables 3.6 and 3.7 present the performance of the TLR method for a = 5 and 
a = 0.2, respectively. For both cases we selected u = 1 and took V = 2 • 10^ 
and Ni = 10^ 

Table 3.6. The evolution of vt for estimating the optimal parameter v* with the 
TLR method for a = 5. The estimated probability is ^ = 5.22 • 10”^, the relative 
error RE = 0.0238, and = 570.98 . 



t 


7t 


Vit 


V2t 


vst 


V4t 


V5t 


0 


- 


1 


1 


1 


1 


1 


1 


2.14 


1.90 


1.88 


1.88 


1.93 


1.93 


2 


5.56 


2.95 


2.94 


2.93 


2.93 


2.96 


3 


13.06 


3.67 


3.62 


3.46 


3.68 


3.87 


4 


22.41 


4.50 


3.99 


4.19 


4.30 


3.89 


5 


25.00 


3.61 


5.35 


3.92 


4.52 


3.88 


6 


25.00 


4.02 


4.24 


4.40 


4.36 


4.45 


7 


25.00 


4.44 


4.30 


4.26 


4.09 


4.27 


8 


25.00 


4.38 


4.18 


4.11 


4.09 


4.63 


9 


25.00 


4.27 


4.07 


4.29 


4.47 


4.25 


10 


25.00 


4.38 


4.33 


4.41 


4.28 


3.92 




3.12 Numerical Results 109 



Table 3.7. The evolution of vt for estimating the optimal parameter v* with the 
TLR method for a = 0.2. The estimated probability is i = 4.86 • 10“^, the relative 
error RE = 0.0267, and ^ = 716.74 . 



t 


7* 


Vit 


0 


- 


1 


1 


2.6e-|-008 


1.74 


2 


4.6e+014 


2.32 


3 


4.9e-h019 


2.78 


4 


4.9e+023 


3.22 


5 


6.7e-h026 


3.56 


6 


1.3e-h029 


3.80 


7 


1.0e-h031 


3.60 


8 


4.5e-h032 


3.74 


9 


2.8e-f-033 


4.00 


10 


le+035 


4.48 


11 


le+035 


4.16 


12 


le+035 


4.37 


13 


le+035 


4.14 


14 


le+035 


4.12 


15 


le+035 


4.30 



V2t 


V3t 


V4t 


V5t 


1 


1 


1 


1 


1.75 


1.75 


1.74 


1.74 


2.34 


2.33 


2.38 


2.34 


2.86 


2.72 


2.89 


2.82 


3.24 


2.99 


3.26 


3.23 


3.47 


3.26 


3.48 


3.58 


4.02 


3.29 


3.74 


3.53 


4.09 


3.74 


4.11 


3.78 


4.05 


3.91 


3.39 


4.67 


4.72 


3.78 


3.81 


4.48 


3.97 


4.12 


4.57 


3.86 


4.35 


4.57 


3.99 


4.11 


4.49 


4.16 


4.13 


4.00 


4.00 


4.25 


4.11 


4.54 


4.24 


4.44 


4.16 


4.24 


4.16 


4.53 


4.18 


4.30 



Although in this case the TLR change of measure (3.121) does not seem as 
“natural” as the SLR one, where a or u is changed, we can see, however, that 
again a good variance reduction is obtained. An advantage of (3.121) is that 
only one line of the code for the Weibull case needed to be changed. Besides, 
as we saw, the SLR method where only a is allowed to change is identical to 
the TLR method described above. 



3.12.3 Stochastic Shortest Path 

Our second model concerns a stochastic shortest path problem of Section 2.2.1. 
That is, we have the weighted graph given in Figure 2.1, with random weights 
Xi, . . . , X5. Suppose the random variables Xi, . . . , X5 are independent and 
have a Weib(ai, ia”^) distribution, i = 1, . . . , 5. Let 5(X) be the length of the 
shortest path from node A to node B. Note that there are four possible paths. 
We wish to estimate from simulation the probability £ = P(5(X) ^ 7) that 
the length of the shortest path S'(X) will exceed some fixed 7. We consider 
the light- and heavy-tail cases = 5 and ai = 0.2, z = 1,...,5. In both 
cases u = (0.25,0.4,0.1,0.3,0.2). Tables 3.8 and 3.9 present the performance 
of Algorithm 3.4.1 with the TLR method (3.119), for the cases a = 5 and 
a = 0.2 respectively. The results are self-explanatory. 




110 3 Efficient Simulation via Cross-Entropy 



Table 3.8. The evolution of vt for estimating the optjmal parameter v* with the 
TLR method and a = 5. The estimated probability is ^ = 1.20 • 10“^°, the relative 
error 0.044. 



t 


7t 


Vlt 


V2t 


V3t 


V4t 


V5t 


0 


- 


1 


1 


1 


1 


1 


1 


0.568 


2.491 


1.530 


1.267 


1.748 


1.931 


2 


0.650 


4.257 


2.152 


1.543 


2.431 


2.977 


3 


0.706 


6.052 


2.705 


1.896 


3.294 


4.153 


4 


0.752 


8.125 


3.476 


2.260 


4.128 


5.360 


5 


0.792 


10.356 


4.074 


2.630 


4.994 


6.687 


6 


0.800 


10.293 


4.126 


2.850 


5.519 


7.460 


7 


0.800 


10.712 


4.265 


2.520 


5.090 


7.109 


8 


0.800 


10.550 


4.125 


2.565 


5.310 


7.383 


9 


0.800 


10.897 


4.377 


2.577 


5.277 


7.096 



Table 3.9. The evolution of vt for estimating the optimal parameter v* with the 
TLR method and a = 0.2. The estimated probability is £ = 1.09 • 10“^^, the relative 
error 0.026. 



t 


7t 


Vlt 


V2t 


V3t 


V4t 


V5t 


0 


- 


1 


1 


1 


1 


1 


1 


6.760 


2.005 


1.906 


1.166 


1.857 


1.912 


2 


159.419 


3.067 


2.911 


1.038 


2.499 


2.619 


3 


1070.002 


4.226 


3.940 


1.052 


3.029 


3.211 


4 


4173.601 


5.320 


4.930 


0.854 


3.598 


3.901 


5 


11663.017 


6.877 


6.333 


1.118 


3.730 


3.867 


6 


34307.081 


9.237 


8.434 


1.078 


3.461 


3.548 


7 


100000.000 


7.030 


6.623 


0.842 


7.762 


7.658 


8 


100000.000 


11.309 


10.660 


1.043 


3.227 


3.474 


9 


100000.000 


14.038 


13.035 


0.981 


1.126 


1.189 


10 


100000.000 


14.261 


13.008 


0.979 


1.066 


1.035 



Tables 3.10 and 3.11 present the performance of the big-step Algorithm 
3.11.1 for the same data as in Tables 3.8 and 3.9, respectively. It is readily seen 
that the big-step algorithm speeds up substantially the process; it converges, 
in fact, after the first iteration. 




3.12 Numerical Results 111 



Table 3.10. The evolution of vt for estimating the optimal parameter v* with the 
TLR method using the big-step Algorithm 3.11.1 and a = 5. Estimated probability 
^ = 1.24 • 10“^°, relative error RE = 0.0438, Matlab CPU 15.2 seconds. 



t 


7t 


Vit 


V2t 


V3t 


V4t 


V5t 


0 


- 


1 


1 


1 


1 


1 


1 


0.8 


7.12 


4.99 


4.36 


5.43 


5.79 


2 


0.8 


10.8 


4.24 


2.70 


5.20 


7.05 


3 


0.8 


11.1 


4.35 


2.57 


5.13 


7.12 


4 


0.8 


11.1 


4.37 


2.54 


5.18 


6.86 


5 


0.8 


11.0 


4.42 


2.86 


5.18 


6.71 


6 


0.8 


10.7 


4.24 


2.89 


5.30 


7.00 



Table 3.11. The evolution of vt for estimating the optimal parameter v* with the 
TLR method using the big-step Algorithm 3.11.1 and a = 0.2. Estimated probability 
£ = 1.082 • 10“^^, relative error RE = 0.026, Matlab CPU 14.8 seconds. 



t 


7t 


Vlt 


V2t 


V3t 


V4t 


V5t 


0 


- 


1 


1 


1 


1 


1 


1 


10000 


7.02 


7.83 


5.31 


7.76 


7.72 


2 


10000 


10.9 


9.91 


1.22 


4.93 


5.31 


3 


10000 


12.8 


12.1 


1.24 


2.67 


2.91 


4 


10000 


14.2 


13.1 


1.08 


1.01 


1.04 


5 


10000 


14.1 


13.0 


0.975 


0.995 


1.10 


6 


10000 


14.2 


13.0 


0.995 


0.991 


1.01 



3.12.4 GI/G/1 Queue 

Our third model is the GI/G /1 queue with interarrival time distribution 
Weib(ai,ix]”^) and service time distribution Weib(a 2? '^^2 discussed in Sec- 
tion 3.10. Note that the traffic intensity of the queue — the expected service 
time divided by the expected interarrival — is thus given by 

U2r{l + ^Ici2) 

uiE{l -h 1/ ai) 

We first consider the estimation of the probability that the stationary 
waiting time in the queue exceeds some fixed level 7 , using the random walk 
method described in Section 3.10. In particular, with Ai and Bi the interarrival 
and service times, we use the TLR change of measure 

Ai = ^.1 (zf )) , Zf ) ~ Exp(l) -4 ExpK- 1 ) 

Bi = U2 (zf , zf ) ~ Exp(l) ^ Exp(t;2-1) . 



(3.123) 




112 3 Efficient Simulation via Cross-Entropy 



Table 3.12 illustrates the evolution of Algorithm 3.4.1 for determining the 
CE optimal parameters v\ and V 2 to be used in the TLR estimator. In this 
particular case the parameters are a\ = 0.5, ui = a 2 = 0.5 and U 2 = 0.5, 
which gives a traffic intensity of 0.5. The level to be exceeded is 7 = 80. The 
sample size used in Steps 1-4 was N = 50, 000. The rarity parameter q was 
set to 0.01. 

We have repeated Steps 2-4 four more times after reaching 7 to show the 
accuracy of the estimation of the true optimal CE parameter. (The corre- 
sponding estimate and RE for this case are given in Table 3.13.) 



Table 3.12. The evolution of Algorithm 3.4.1 using the TLR method for the 
G//G/1 with the following parameters: a\ = 0.5, u\ = 1, U2 = 0.5, U 2 = 0.5 . 



t 


7t 


Vit 


V2t 


0 


- 


1 


1 


1 


39.5 


0.774073 


1.39477 


2 


80 


0.796896 


1.44949 


3 


80 


0.813729 


1.42962 


4 


80 


0.810056 


1.40465 


5 


80 


0.799487 


1.43608 


6 


80 


0.801236 


1.44118 



It is interesting to note that after one iteration the system becomes un- 
stable, so that 7t in Step 2 of the CE algorithm reaches level 7 in just two 
iterations. This is in accordance with the instability property of the CE al- 
gorithm described and analyzed in [46]. As a consequence, the choice of the 
rarity parameter does not matter very much. 

Tables 3.13-3.16 summarize some performance characteristics of the TLR 
estimation procedure as a function of 7, for various light- and heavy-tail cases. 
In all cases we set N = 10^ and Ni=b - 10^. Also, the rarity parameter g was 
set to 0.1 (in fact any parameter ^ < 1 would be ok) and the level —L was 
set low enough to —100. 

In all tables we report the optimal CE parameters (the original ones are 
1), the estimate of the probability, the RE and the CPU time in seconds. 



Table 3.13. Simulation results for method 2 for the waiting time probabilities of 
a GI/G/1 queue with heavy-tail interarrival and service time distributions, as a 
function of 7. The traffic intensity is 0.5. 







tti = 0.5, 


Ui = 1, 


tt2 = 0.5, U2 = 


0.5 




7 


20 


40 


60 


80 


100 


120 




0.78 


0.79 


0.80 


0.80 


0.80 


0.81 


V2 


1.36 


1.38 


1.40 


1.41 


1.43 


1.45 


? 


7.139- 10"^ 


1.152 • 10“ 


2.08-10“ 


4.25-10"^ 


8.99 - 10-^ 


2.08 - 10"^ 


RE 


0.002 


0.0036 


0.0067 


0.016 


0.020 


0.045 


sec 


149 


264 


396 


467 


587 


696 




3.12 Numerical Results 113 



Table 3.14. Simulation results for method 2 for the waiting time probabilities of a 
GI/G /1 queue with light-tail interarrival and service time distributions, as a function 
of 7 . The traffic intensity is 0.75. 







0 

II 

to 


Ui = 1 , 


G2 = 2jU2 = 


0.75 




7 


3 


6 


9 


12 


15 


18 




0.56 


0.56 


0.56 


0.56 


0.56 


0.56 


^2 


1.57 


1.58 


1.58 


1.58 


1.58 


1.59 


£ 


1.031 • 10“^ 


1.63 • 10"^ 


2.60 • 10 “ 


4.15*10“® 


6.63 • 10“^® 


1.56* 10“^^ 


RE 


0.0017 


0.0027 


0.0040 


0.0053 


0.013 


0.016 


sec 


101 


154 


210 


274 


338 


398 



Table 3.15. Simulation results for method 2 for the waiting time probabilities of 
an M/M/1 queue, as a function of 7 . The traffic intensity is 0.75. 



ai = l,ui = 2, = 1,U2 = 1.5 


7 


20 40 60 80 100 120 


V2 


0.75 0.75 0.75 0.75 0.75 0.75 

1.33 1.33 1.33 1.33 1.33 1.33 


£ 

RE 

sec 


2.676 • 10~^ 9.539 • 10“^ 3.404 • 10~® 1.214 • lO"® 4.333 • 10"® 1.546 • 10"® 
0.00036 0.00039 0.00040 0.00040 0.00038 0.00053 

160 429 509 558 691 828 



Table 3.16. Simulation results for method 2 for the waiting time probabilities of 
an M/G/1 queue, with heavy-tail service distribution, as a function of 7 . The traffic 
intensity is 0.5. 







ai = l,ui 


= 1 , 


= 0.5, U2 = 


0.25 




7 


10 


20 


30 


40 


50 


60 




0.81 


0.83 


0.84 


0.85 


0.85 


0.89 


V2 


1.64 


1.68 


1.71 


1.65 


1.73 


1.77 


£ 


2.83 • 10“^ 


3.55 • 10“® 


5.63 • 10“^ 


1.05 • 10“^ 


2.60 • 10 “® 


7.07 • 10“® 


RE 


0.003 


0.0067 


0.012 


0.017 


0.047 


0.093 


sec 


108 


190 


224 


306 


335 


407 



The results seem to indicate that the RE increases (sub) linearly, but there 
is not sufficient evidence to conclude that the estimators are polynomial, ex- 
cept in the M/M/1 case, where the RE remains constant. In the latter case 
we have the well-known optimal (exponential) change of measure where the 




114 3 Efficient Simulation via Cross-Entropy 



service and interarrival rates are interchanged. What is clearer is that for the 
light-tail case we can estimate much smaller probabilities than for the heavy- 
tail case, for a given accuracy (RE) and simulation effort. It is interesting to 
note that for the second experiment (with ai = a 2 = 2) quite small proba- 
bilities can be efficiently estimated despite the fact that the TLR estimator 
is not logarithmically efficient. Specifically, the only logarithmically efficient 
estimator is obtained by an exponential change of measure [150, 18]. The 
TLR change of measure for this case is obviously not an exponential change 
of measure. 

Note also that for both light-tail cases the reference parameters seem to 
have “converged,” but not yet for the two heavy-tail cases. Also the estimates 
for the reference parameters seem more noisy in the heavy-tail case. In both 
the light- and heavy-tail case we observed that the estimates for the proba- 
bilities stabilized quite quickly (for moderate sample sizes). However, we also 
observed that accurate estimates for the variance of the estimator were much 
more difficult to obtain in the heavy-tail case than in the light-tail case. 

We have repeated the experiments in Tables 3.13-3.16 for method 1, the 
switching regenerative method^ using ATi = 5 • 10^ regeneration cycles and 
using exactly the same CE parameters v{ and as reported for method 2. 
The results were very similar to those of method 2. Tables 3.17 and 3.18 give 
the results for two of these experiments. We also ran the model with crude 
Monte Carlo, that is method 1 with v\ = V 2 = I, increasing the number 
of cycles to 5 • 10^ in order to obtain execution times of the same order as 
the other methods. The CMC estimates were in close agreement with the 
IS estimates, and the IS estimates consistently gave a significant variance 
reduction, although less pronounced in the heavy-tail case. 



Table 3.17. Simulation results for method 1 for the waiting time probabilities of 
a GI/G /1 queue with heavy-tail interarrival and service time distributions, as a 
function of 7 . The traffic intensity is 0.5. 







ai = 0.5. 


= 1 , 


U 2 = 0.5, U2 ■ 


= 0.5 




7 


20 


40 


60 


80 


100 


120 




7.19 • 10"^ 


1.161 • 10 " 


2 2.08-10"' 


^ 4.38-10"^ 


8.63 • 10"^ 


2.01 • 10 "^ 


RE 


0.0087 


0.011 


0.017 


0.029 


0.034 


0.071 


sec 


59 


80 


109 


135 


170 


200 




3.12 Numerical Results 115 



Table 3.18. Simulation results for method 1 for the waiting time probabilities of a 
GI/G /1 queue with light-tail interarrival and service time distributions, as a function 
of 7 . The traffic intensity is 0.75. 



ai =2,ui = 1, tt 2 = 2, U 2 = 0.75 


7 


3 


6 


9 


12 


15 


18 


£ 

RE 

sec 


1.028 • 10"2 
0.0064 
51 


1.63 • 10"^ 
0.0084 
91 


2.58 • 10"® 
0.011 
167 


4.12 • 10~® 
0.019 
173 


6.76 • 10"^® 
0.020 
212 


1.05-10"^^ 

0.021 

398 



We also conducted various experiments in the transient setting (that is 
taking L = 0, see Remark 3.32, and using Pareto arrival and service times). 
Tables 3.19 and 3.20 present two examples. Table 3.21 presents an example 
using Pareto arrival and Weibull service time. For the Pareto case a similar 
TLR change of measure as in (3.121) was used. In all tables r is as in (3.115) 
with L = 0 and SCV stands for the squared coefficient of variation for the 
random variable IW in the TLR estimator. 



Table 3.19. Transient simulation results as function of 7 for the GI/G/1 queue 
with the interarrival time distribution Pareto(0.5, 0.4) and service distribution 
Pareto(0.5,0.36). The traffic intensity is 0.9. For 7 = 40 the probability was checked 
by CMC estimation: ^ = 1.78 • 10“^ 



7 


40 


120 


160 


240 


300 


360 


N 


5 - 10 ^ 


5-10^ 


5 - 10 ^ 


5-10^ 


5-10^ 


5- lO'* 


ATi 


5-10® 


5-10® 


5-10® 


5-10® 


10 ® 


10 ® 


£ 


1.76e-002 


1.49e-003 


0 

0 

00 


5.32e-005 


l.OOe-005 


1.81e-006 


RE 


0.0068 


0.013 


0.016 


0.023 


0.021 


0.026 


r 


94.26 


655.96 


1137.86 


2075.22 


3305.66 


3703.51 


SCV 


23.46 


87.73 


129.57 


283.18 


430.81 


664.02 



Table 3.20. Transient simulation results as function of 7 for the GI/G/1 queue with 
the interarrival time distribution Pareto(3, 0.75) and service distribution Pareto( 3 , 1 ). 
The traffic intensity is 0.75. 



7 


25 


50 


80 


120 


250 


350 


N 


lo® 


10 ® 


io® 


10 ® 


To® 


To® 


Ni 


5-10® 


5-10® 


5 - 10 ® 


5-10® 


5-10® 


5-10® 


£ 


1.25e-003 


8.72e-005 


9.66e-006 


2.51e-006 


2.59e-007 


7.54e-008 


RE 


0.011 


0.029 


0.041 


0.047 


0.053 


0.051 


T 


75.60 


88.26 


69.75 


79.20 


92.18 


93.76 


SCV 


61.81 


427.33 


841.23 


1082.71 


1387.39 


1301.01 




116 3 Efficient Simulation via Cross-Entropy 



Table 3.21. Transient simulation results as function of 7 for GI/G/1 queue with 
the interarrival distribution Weib( 2 , 1 ) and service distribution Pareto(2.5, 1). The 
traffic intensity is 0.75225. 



7 


20 


50 


130 


160 


300 


400 


N 


5-10^ 


5-10“ 


5 - 10 ^ 


5-10^ 


5-10^ 


5-10^ 


Ni 


2 - 10® 


3 - 10® 


3 - 10® 


3 10® 


3 10® 


3-10® 


i 


2.37e-003 


2.20e-004 


1.05e-005 


8.256-006 


1.33e-006 


5.756-007 


RE 


0.016 


0.030 


0.033 


0.029 


0.028 


0.027 


T 


18.38 


17.21 


16.03 


14.44 


16.08 


14.92 


scv 


51.94 


275.31 


326.70 


258.69 


239.94 


220.05 



3.12.5 Two Non-Markovian Queues with Feedback 

As a final example, we consider the network depicted in Figure 3.3. It consists 
of two queues in tandem, where customers departing from the second queue ei- 
ther leave the network (with probability p), or go back to the first queue (with 
probability 1 — p). We are interested in estimating the transient probability 
that the total number of customers in the network exceeds some high level, 
50 in this example, during one busy cycle. This model was also considered 
in [45], using only light-tail distributions and applying IS with exponential 
twisting. 




Fig. 3.3. Two queues in tandem with feedback 



In the experiments reported below the interarrival time distribution is 
a two-stage Erlang distribution, with exponential parameter A = 0.2. The 
service time distribution of the first queue is uniform on [0,3.333]. In the 
second queue the service time distribution is Weib(a, c). In Table 3.22 we 
consider the light-tail case with a = 2 and c = 0.354491, which gives a mean 
service time of 2.5, while in Table 3.23 we consider the heavy-tail case with 
a = 0.8 and c = 0.453201, which gives again mean service time of 2.5. We 
note that this is the same mean service time as in [45] . In the tables, 6 is the 
exponential twisting parameter for the uniform distribution. The A column 
gives the evolution of reference parameter for the Erlang inter arrivals, and 
similarly for c and p. 




3.12 Numerical Results 117 



Table 3.22. Simulation results for the non-Mar kovian network for the light-tail case 
with a = 2, c = 0.35449, 7 = 50, and sample sizes N = Ni = 10"^. The estimated 
probability is ^ = 1.62 10“^^, the relative error RE = 0.018. 



t 




At 


9t 


Ct 


Pt 


0 


3.0 


0.200000 


0.000000 


0.354491 


0.5 


1 


50 


0.342317 


-0.023671 


0.294095 


0.177778 


2 


50 


0.363233 


0.000000 


0.315648 


0.225282 


3 


50 


0.360159 


0.000000 


0.320599 


0.234336 


4 


50 


0.360873 


-0.003051 


0.320986 


0.234113 


5 


50 


0.358857 


-0.003623 


0.320894 


0.235779 


6 


50 


0.360186 


-0.000707 


0.320591 


0.234769 


7 


50 


0.359469 


-0.003483 


0.320718 


0.234796 



We see that the optimal CE parameters are estimated quite accurately for a 
relatively small N. Since the second queue is the bottleneck state independent 
tilting^ changing the parameters irrespective of the state of the queue, seems 
to work nicely, and the TLR method seems to deliver an accurate estimate 
of a very small probability. No numerical results are available for validation; 
therefore, we repeated the experiment a number of times. The fact that we 
obtained similar estimates gives confidence. 



Table 3.23. Simulation results for the non-Mar kovian network for the heavy-tail 
case with a = 0.8, c = 0.453201, 7 = 50, and sample sizes N = N\ = 10^. The 
estimated probability is ^ = 4.323 10“^®, the relative error RE = 0.0079. 



t 


7t 


At 


% 


Ct 


Pt 


0 


3.0 


0.200000 


0.000000 


0.453201 


0.5 


1 


50 


0.300620 


0.000000 


0.263503 


0.3019 


2 


50 


0.301135 


0.000000 


0.263982 


0.3031 


3 


50 


0.301291 


-0.000000 


0.264346 


0.3026 


4 


50 


0.300832 


0.000000 


0.263580 


0.3031 


5 


50 


0.301350 


-0.000000 


0.263770 


0.3029 


6 


50 


0.300620 


0.000000 


0.263503 


0.3019 


7 


50 


0.301135 


0.000000 


0.263982 


0.3031 



For this heavy-tail case a similar picture emerges: the estimates for the 
reference parameters are quite stable, and a small probability can be estimated 
with reasonable accuracy. However, when we repeat this for a smaller a {a = 
0.5) the results were not so satisfactory, indicating that a (much) larger sample 
size is required. 




118 3 Efficient Simulation via Cross-Entropy 

3.13 Appendix: The Sum of Two Weibulls 



As noted in Remark (3.30) for the sum of n heavy-tail Weibulls, the change of 
measure given by (3.102) for any constant c in (3.103) gives an SLR estimator 
which is logarithmically efficient. A proof of this is given in Theorem 3.2 of 
[91]. In this appendix we prove that for the case n = 2 and for large 7 the 
best — that is, minimum variance — choice for c is c = n = 2 and that 
the estimator is not only logarithmically efficient, but in fact polynomial. We 
conjecture that in general c = n. We show explicitly that the relative error 
grows (for n = 2) as 7^“, and we conjecture that in general it grows as 7^^". 
The proof below uses the representation of the change of measure, but it could 
as easily have been given via an SLR approach. Most of the results hold for the 
light- (a > 1) and heavy-tail a < 1 case, except when the subexponentiality 
property is used for the heavy-tail case. Without loss of generality we take 
u=l. 

Thus the problem is as follows: Let Ai, A2 be i.i.d. Weib(a, 1) distributed; 
estimate 

I = P(Xi + X2 > 7) = ^ 7) , 

with Zi ~ Exp(l), independent. Consider the exponential change of measure 
Zi ~ Exp(l) — > Exp(l - 6), where 0 < 0 < 1 is the exponential twisting 
parameter. Let E^i denote the corresponding expectation operator. Thus Eq 
corresponds to the original Exp(l) distribution. We have 






^7} 



w . 



Here W = W(0) is shorthand notation for the likelihood ratio 

p-0(Zi+Z2) 

W{e) = ^ , 

where we have used the fact that the cumulant function for this exponential 
family is given by (^{6) = ln(l/(l — 6)) = — ln(l — 0). 

There is no simple formula for ^ as a function of a and 7, but it is not 
difficult to verify that 



=(2 [[ + [[ +2 [[ ) dzidz2 

V d d d d A ^2 d d >13/ 

= exp (-7“ 2^““) + 2 y exp |7 - - xj dx , 



where the regions Ai, A2 and As are given in Figure 3.4. 




3.13 Appendix: The Sum of Two Weibulls 119 




Let us mention some known facts about First, for the heavy-tail case 
a < 1 it is well known that the Weibull distribution is suhexponential^ which 
means that the sum of n i.i.d. Weibull random variables satisfies 



lim 

7— >-oo 



P(Xi + ■ ■ ■ + Xn ^ 7) 
P(Xl ^ 7) 



= n . 



In particular, for our n = 2 case we have that 



lim ^ 

7-^00 2e 



= 1 . 



For a = 1 it is not difficult to see that 



£ = e -^(7 + l ). 

For a > 1 one can show that 

^(7) 

e-2(7/2)« ^a/2 = > 

for some constant c(a), decreasing as a increases. For example, for a = 2, 
c(a) = ^7 t /2 and for a = 3, c{a) = \/3^/4. 

Let us now turn to the complexity properties of the TLR estimator, as a 
function of 7. These are, as always, determined by the second moment (under 




120 3 Efficient Simulation via Cross- Entropy 



6 ) of the random variable IW = Using a simplified 

notation we have 



EgilW)"^ = Eg IW"^ 

= Eo IW 

-e(Zi+Z 2 ) 

= EoI- 






( 3 . 124 ) 



j-(l+e)(zi+Z2) 



(1-er 



dz 



We wish to show that the SCV increases at most polynomially in 7, for a 
certain choice of 6. This is equivalent to showing that E0{IW)^ increases 
at most polynomially in 7. We do this by considering the contributions of the 
three integrals in ( 3 . 124 ) individually. 

Define A = 
namely 



(i-e)2 






dz, i = 1 , 2 , 3 . The easiest of these is A; 
-(l+e)(7/2)“ ' 2 



1-612 



It follows that, for fixed 6 , 



lim 

7— >-CX) 



A 

£2 



(1 — 0^)^ 7->oo 



lim ^ 0 ^ 



provided that 1 + 0 > 2", or equivalently 1 — 0 < 2 — 2“ . 
Second, we have 



pOO poo 

A < = / / 

Jo -' 7 “ 



0— (14-^) (2:1 +22) 

(1 - oy 



dz = 



^ 0 - 7 ^(l+^) 

(1 - 02)2 



The contribution of D\ to the SCV is therefore bounded by 



Di 1 

~ 2 (1 - 02)2 



e- 7 “(i+e- 2 ) _ 



As a consequence, this contribution remains polynomial in 7 if we choose 
0 = 1 — C7““, for any c. In that case 



Di _ e^7^^ 

(? 4c2(c — 27^)2 

If we minimize this with respect to c, we obtain for fixed 7 the minimal 
argument 



c* = 7“ — \/72® -h 4 + 2 . 

For large 7 we have thus 2 . This suggests we take 




3.14 Exercises 



121 



<9 = 1- 27-^ . 



Obviously, with this choice of 6 the contribution of D2 to the SCV tends 0, 
as 7 increases. It follows that 



2D\ + D2 




+ o (7^“) . 



It remains to show that the contribution of D3 remains polynomial. We have 



dz 

2(l-6»)2(l + i9) ’ 



where 

For fixed 2; and 6 = 1 — 27“" write the integrand of ds as 7), 

where 

g{z,^) = e27“ |e-(2-27-“h“(l-^^^“/7)“ _ g-(2-2-,-»h“ | 




decreases monotonically to 0 as 7 ^ 00. By the monotone convergence theo- 
rem, it follows that ^3 -> 0 as well, as 7 ^ 00. Hence, we have 1^3/^^ = o (7^“) . 

Concluding, for a < 1 we have proved that with the exponential twist 
6 = 1 — 27“" the SCV of the TLR estimator increases proportionally to 7^", 
as 7 — > 00, that is 

^^(7) — 0(7^“) as 7 — )► 00 . (3.125) 

It is interesting to note that decreases with a, that is as the tail of Weibull 
pdf becomes heavier. 

We conjecture that for arbitrary n the optimal twisting parameter is 
asymptotically 6* ^ 1 — 727““ and that the SCV increases proportionally 
to 7^“, as 7 -> oc. 




3.14 Exercises 

1. Consider the estimation of the probability £ = P(V ^ 7), 0 < 7 < 1, 
where X ~ Beta{uo, 1). That is, the pdf of X is given by 




122 3 Efficient Simulation via Cross- Entropy 



f{x; uo)-t/ox‘'° ^ 0 ^ a; < 1 . 

Suppose we wish to estimate £ via the SLR method. That is, through the 
change of measure 



X ~ Beta(i/ 0 j 1) — > Beta(i/, 1), 
for some i/, using the SLR estimator 

( 3 - 126 ) 

i=\ * 

a) Using the cross-entropy program 



maxD(i/) = maxEi.o/{x^7}ln/(X;i/) (3.127) 



find the optimal reference parameter i/* analytically. Show that 



1 ^* ^ , as 7 — > 1 . 

1-7 



(3.128) 



We could also estimate £ with TLR method, for example via the (TLR) 
change of measure 

X = ~ U(0, 1) ^ Beta(^, 1) , 



for some 

b) Verify from Example 3.26 that 

( 3129 ) 

c) Show that this TLR change of measure is equivalent to the SLR change 
of measure above — in the sense of Section 3.9 — if we take u = 

d) Show that (3.129) implies (3.128), as 7 ^ 1. 

e) Use Proposition 3.28 to show that for 7 1 the square coefficient of 

variation of £ in (3.126) 

2 Var(?) 1 e^-l 

using the reference parameter i/* = 2/(1 — 7). Find also the for the 
CMC estimator, (without change of measure) and compare the two. 

f) Write a computer program to estimate (for the SLR case) both z/* 
and £ using Algorithm 3.4.1. Set g = 10“^, uq = 2, N = 10^ (for 
estimating z/*) and A^i = 10^ (for estimating £), respectively. Select 
7 such that £ = 10“^^. Present a table similar to Table 2.1 of the 
tutorial chapter. 




3.14 Exercises 123 



2. Implement the CE method for the three different cases (SLR, ITLR and 
TLR) in Section 3.9, taking n = 1. Set g = 10“^, u = 1 and N = 10^. 
Select Ni large enough such that the estimated relative error is less than 
0.05. Select 7 (by trial and error) such that £ < 10“^^. 

3. Reproduce the results of Tables 3.8 and 3.9, by modifying the code of the 
first toy example in Appendix A.l. 

4. Reproduce the results of Tables 3.2 and 3.3, by changing only the perfor- 
mance function in the code of the above example. 

5. Reproduce the results of Tables 3.6 and 3.7, by altering in the code for 
Exercise 4 only the change of measure, from (3.119) to (3.121). 

6. Repeat Exercise 4 with the big-step method. 

7. Max-cut: Rare-event probability estimation. Consider the synthetic 
max-cut problem of Section 2.5.1, with cost matrix (c^j) of the form (2.38). 
Let po = (1/2, . . . , 1/2) and X ~ Ber(po). We wish to estimate the prob- 
ability i = Ppo(5(X) ^ 7) for some large 7. This can be done efficiently 
via (3.36) using X^ ~ Ber(p) instead of X^ ~ Ber(po). To find the optimal 
p run the CE Algorithm 3.4.1 for 20 iterations, reporting the estimates 
of 7t and pt and the rare-event probability it = Ppo(5(X) ^ 7t). Take 
n = 50, ^ = 0.01, N = 10 n. Eliminate the stopping rule, since 7 is not 
specified. Discuss your results. 

8. Consider the graph of Figure 2.1, with random weights Xi, . . . , X5. Sup- 
pose the weights are independent random variables, with Xi f{x;ui). 
Set Ui = U 2 = 0.2 and generate the rest of the parameters 1^3, 1^4, 1^5 
randomly from the uniform distribution U (0.5, 1.0). Let 5(X) be the total 
length of the shortest path from node A to node B. Set g = 0.01, N = 1000 
and run Algorithm 3.4.1 for the following cases 

a) f{x;Vi) = x ^ 0. (exponential) 

b) f{x]Vi) = /x\, X = 0, 1, — (Poisson) 

c) f{x]Vi) = ce~^^'^\ 0 ^ X ^ 1. (truncated exponential) 

d) /(x; Vi) = vf {1 - X = 0, 1. (Bernoulli) 

When estimating i using the LR estimator £ choose the final sample 
size Ni large enough to ensure that the relative error k < 0.05. Also, 
make sure that 7 is chosen such that the rare-event probability £ satis- 
fies 10“^ < £ < 10“^. For the truncated exponential and the Bernoulli 
case run Algorithm 3.4.1 without fixing 7 in advance. By doing so, % in 
Algorithm 3.4.1 will attempt to continually increase 7. As a result, the 
parameters Vi and V 2 associated with the random variables Xi and X2, 
called the bottleneck parameters, will produce degenerated marginal dis- 
tributions. In particular, for i = 1, 2 we obtain Vi ^ 00 and Vi = 1 for the 
truncated exponential and the Bernoulli case, respectively. 




124 3 Efficient Simulation via Cross-Entropy 



9. Verify (3.104) and (3.105). 

10* Consider the double exponential density 

a;€lR, 6 = {a,b) . 

The optimal solution of (3.65) follows from maximization of 

Y, liWi ln(6) - b Y liWi - a| , 

i i 

using similar abbreviations as in (3.60). For each fixed 5, the optimal a 
follows from minimization of 

YliWi\Xi-a\ , 
i 

which is very akin to determining the median of a discrete distribution. 
In particular, let . . . , X(jv) denote the order statistics of Xi, . . . , 
and for each X(^) let /(i)IV(i) be the corresponding product in {/iVFi, . . . , 
InWn}- Now define K as the first index such that 

. 1 



Verify that the CE optimal parameters a and b are given by 



and 



EjijWj 

ZiIiWi\Xi-a\ 



11* A GI/G/1 queue with exponential interarrival times is called an M/G/1 
queue. Consider an M/G/1 queue with Exp(A)-distributed interarrival 
times and Weib(a, iA“^)-distributed service times. Let g denote the traffic 
intensity: 



Q = 



EB 

1/X 



Xur{l + l/a) . 



The famous Pollaczek-Khinchine formula [12] states that the steady-state 
waiting time Z in an M/G/1 queue satisfies 



P(Z ^j) = (l-e)Y ^™(l - Fr(7)) , (3.130) 

m=0 



where is the cdf of the sum of m stationary excess service times, that 
is, is the m-fold convolution of the cdf Fb with density 




3.14 Exercises 125 



fsix) 



1-Fb{x) 

EB 



X ^ 0 , 



where Fb is the cdf of the service time. Let M be a “geometric” random 
variable with the following distribution 



P(M = m) = {l- q)q^, n = 0, 1, 2, . . . , (3.131) 

and let Xi,X2 , . . . be i.i.d. with cdf Fb and independent of M. Then we 
may write (3.130) as the random sum 

¥{Z ^ 7) = P( 5 m ^ 7) . 

with Sm ~ -^1 ■ * ' H” 

a) Describe how we can estimate F{Z ^ 7) via CMC. 

b) In order to apply CMC we need to draw from the cdf Fb . Verify that 

if F - Gam(l/a, 1), then - Fb. 

c) Describe how we can estimate P(Z ^ 7) via IS, by sampling from the 
excess lifetime distribution of Weib(a, for some v. 

d) Derive the likelihood ratio of (Xi, . . . , Xm) for this change of measure. 

e) Verify the CE updating rule for this change of measure: 

V J 

12. Implement the root finding Algorithm 3.7.1 and verify that for the toy 
example in Section 2.2.1 the root of the equation 

10“® = P(5'(X) ^ 7) 



is given by 7 2.40. 



13* Consider the random walk {5^, n = 1,2 ,...} with 5n = Xi + • • • + X^, 
such that each X^ takes values 1 and —1 with probabilities p and q = 1— p, 
respectively. Define the NEF {f(-;6),6 ^ 0} by 

f{x;6) a;e{-l,l}, 9^0, 

with h{l) = p, h{—l) = q and 

C(5) = InEe^^i = ln(pe^ + qe~^) . (3.132) 



We introduce a change of measure for each X^ via the NEF above. Note 
that for 0 = 0 we get our original distribution. 

a) Verify that the change of measure parameterized by 9 is equivalent to 
Xk taking values 1 and —1 with probabilities 



P = 



pe^ 

pe^ + qe~^ 



qe~ 



pe^ + qe~^ 



and q = 



(3.133) 




126 3 Efficient Simulation via Cross- Entropy 



Define the probability measure such that under this measure the ran- 
dom walk {S'n, n = 0, 1 , . . .} starts in and the parameters p and q are 
changed to p and q in (3.133). The corresponding expectation operator is 
denoted by E^. For the original measure we use similarly P^ and E^. Let 
A be the event that the random walk reaches 7 — 2 before —i and let T be 
the first time that it reaches ^ — i or —i. Let ii = Fi{A) = Fi{ST = 7 — i). 

b) Prove that 



1 - {q/pY 
' 1 - {q/pP ’ 






(3.134) 



This is known as the gambler’s ruin [73]. 

c) Verify that the likelihood ratio of (Vi, . . . , Xn) with respect to P^ and 
¥i is 

^6Sn-\-riK{6) 



We may estimate £i by simulating independent copies of the random vari- 
able Z = IaWt under P^, and then taking the average. To find the optimal 
CE parameter 0 = 6*, reparameterize the NEF via the mean v = C{6). 

d) Show that the optimal CE parameter v* satisfies 

* _ ^JaSt 
"" ~ EJaT • 

e) Demonstrate that a zero- variance way to simulate ii is to use IS with 
a state-dependent change of measure in which the p and q are replaced 
with 



Pk<xpik+1 and qkocqik-i, fc = l,..., 7 -l, (3.135) 

where (x is the symbol for proportionality. 

More information on CE method for the random walks and queueing net- 
works can be found in [46]. 



14* Network reliability. Consider a network of unreliable links, labeled 
{1, . . . ,m}, modeling for example a communication network. With each 
edge i we associate its “state” Xi, such that = 1 if link i works and 
Xi = 0 otherwise. We assume that Xi ~ Ber (pi) and that all Xi are 
independent. Let X = (Xi, . . . , X^). We are interested in estimating the 
reliability r of the network, expressed as the probability that certain nodes, 
say in some set /C, are connected. Thus, 

r = E <^(X) = ^ (/j(x) P(X = x) , (3. 136) 



where 



(fix) = 



1 if /C is connected, 
0 otherwise. 




3.14 Exercises 127 



This is the standard formulation of the reliability of unreliable systems; 
see for example [62]. For highly reliable networks, that is, pi ^ 1, or 
equivalently, = 1 — 0, it is more useful to analyze or estimate the 

system unreliability 

f = 1 — r . 

a) Consider the bridge network of Figure 2.1. Assume that the system 
works if A and B are connected. Express the unreliability of the system 
in terms of the link unreliabilities {qi}. 

b) Explain how we can simulate r via CMC, and show that this is very 
inefficient when r is small. 

A simple CE modification [87] can substantially speed up the estimation 
of f, when this quantity is small. The idea is to translate the original prob- 
lem (estimating f), which involves independent Bernoulli random variables 
Xi, . . . , Xm^ into an estimation problem involving independent exponen- 
tial random variables Ti, . . . , Trn- Specifically, imagine that we have a time- 
dependent system in which at time 0 all edges have failed and are under 
repair, and let Yi, . . . , Ym, with Yi ~ Exp(u“^) and Ui = 1/A(i) = -1 / Inqi 
be the independent repair times of the edges. Note that, by definition 

P(yi ^ 1) = i = 

Now, for each Y = (Yi, . . . , Y^) let S{Y) be the random time at which 
the system “comes up” (the nodes in /C become connected). Then, we can 
write 

f = P(5(Y) ^ 1) . 

Hence, we have written the estimation of r in the standard rare-event 
formulation of (3.1) and thus can directly apply the main CE Algo- 
rithm 3.4.1. 

c) Explain how the CE method can be employed using the above repre- 
sentation. 

Finally, we remark that a more sophisticated and significantly faster CE 
method, based on graph- evolution models [53], is given in [87]. 

15. Let Yi, Y2 be i.i.d. U(0, l)-distributed random variables. We wish to esti- 
mate 

i = ¥ ((-ln(ri))i/“ + (-ln(y2))^/“ lo lo ’ 

for some fixed a > 0 and 7 > 0, where A = {{yi,y 2 ) € [0,1]^ : 
(-ln(2/i))^/“ + (-ln(j/2))^^“ ^ 7}- Let 6=1- and define B 

to be the region [0, 1]^ \ [0, 6]^. 
a) Show that A c B. 




3 Efficient Simulation via Cross-Entropy 

b) By (a) we may estimate I via IS using the change of measure whereby 
Y = (Yi,Y 2 ) is uniform on B, rather than [0,1]^. Verify that the 
conditions for (3.5) are satisfied. 

c) Show that the likelihood ratio for this change of measure is 

yeB. 

d) Show that the SCV of Z = I[YeA}W(Y) is given by 



In other words, this IS estimator has exponential complexity. 




4 



Combinatorial Optimization via Cross-Entropy 



4.1 Introduction 

In this chapter we show how the CE method can be easily transformed into 
an efficient and versatile randomized algorithm for solving optimization prob- 
lems, in particular combinatorial optimization problems. 

Most combinatorial optimization problems, such as deterministic and 
stochastic (noisy) scheduling, the travelling salesman problem (TSP), the 
maximal cut (max-cut) problem, the longest path problem (LPP), optimal 
buffer allocation in a production line, and optimization of topologies and con- 
figurations of computer communication and traffic systems are NP-hard prob- 
lems. Well-known stochastic methods for combinatorial optimization problems 
are simulated annealing [1, 2, 38, 138], initiated by Metropolis [118] and later 
generalized in [80] and [100], tabu search [66], and genetic algorithms [69]. For 
very interesting landmark papers on the simulated annealing method see [138]. 
For some additional references on both deterministic and stochastic (noisy) 
combinatorial optimization see [2, 4, 6, 8, 38, 86, 88, 89, 93, 94, 95, 100, 101, 
102, 108, 109, 111, 123, 124, 127, 129, 130, 134, 135]. 

Recent randomized algorithms for combinatorial optimization include the 
nested partitioning method (NP) of Shi and Olafsson [158, 159], the stochastic 
comparison method of Gong [70], the method of Andradottir [10, 11], and 
the ant colony optimization (AGO) meta-heuristic of Dorigo and colleagues 
[49, 77], 

The basic idea behind the NP method is based on systematic partitioning 
of the feasible region into smaller subregions until some of the subregions 
contain only one point. The method then moves from one region to another 
based on information obtained by random sampling. It is shown that the NP 
algorithm converges to an optimal solution with probability one. 

The stochastic comparison method is similar to that of simulated anneal- 
ing, but does not require a neighborhood structure. 

The method of Andradottir can be viewed as a discrete stochastic ap- 
proximation method. The method compares two neighboring points in each 




130 4 Combinatorial Optimization via Cross-Entropy 

iteration and moves to the point that is found to be better. This method is 
shown to converge almost surely to a local optimum. Andradottir [11] also de- 
veloped a similar method for finite noisy optimization which converges almost 
surely to a global optimum. 

The ant algorithms are based on ant colony behavior. It is known that 
ant colonies are able to solve shortest-path problems in their natural envi- 
ronment by relying on a rather simple biological mechanism: while walking, 
ants deposit on the ground a chemical substance, called pheromone. Ants have 
a tendency to follow these pheromone trails. Within a fixed period, shorter 
paths between nest and food can be traversed more often than longer paths, 
and so they obtain a higher amount of pheromone, which, in turn, tempts a 
larger number of ants to choose them and thereby to reinforce them again. 
The above behavior of real ants has inspired many researchers to use the ant 
system models and algorithms in which a set of artificial ants cooperate for 
food by exchanging information via pheromone depositing either on the edges 
or on the vertices of the graph. Gutjahr [76, 75, 77] was the first to prove the 
convergence of the ant colonies algorithm. 

This chapter deals mainly with applications of CE to deterministic COPs 
— as opposed noisy COPs, which will be discussed in Chapter 6 — with a 
particular emphasis on the max-cut problem and the TSP. The CE method 
for combinatorial optimization was motivated by the work [144], where an 
adaptive (variance minimization) algorithm for estimating probabilities of 
rare events for stochastic networks was presented. It was modified in [145] 
to solve continuous multi-extremal and combinatorial optimization problems. 
Continuous multi-extremal optimization is discussed in Chapter 5. 

The main idea behind using CE for COPs is to first associate with each 
COP a rare-event estimation problem — the so-called associated stochastic 
problem (ASP) — and then to tackle this ASP efficiently using an adaptive 
algorithm similar to Algorithm 3.4.1. The principle outcome of this approach 
is the construction of a random sequence of solutions which converges proba- 
bilistically to the optimal or near-optimal solution of the COP. 

As soon as the ASP is defined, the CE method involves the following two 
iterative phases: 

1. Generation of a sample of random data (trajectories, vectors, etc.) accord- 
ing to a specified random mechanism. 

2. Updating the parameters of the random mechanism, typically parameters 
of pdfs, on the basis of the data, to produce a “better” sample in the next 
iteration. 

The ASP for COPs can usually be formulated in terms of graphs. De- 
pending on the particular problem, we introduce the randomness in the ASP 
either in (a) the nodes or (b) the edges of the graph. In the former case, we 
speak of stochastic node networks (SNNs), in the latter case of stochastic edge 
networks (SENs). Notice that a similar terminology is used in Wagner, Lin- 




4.1 Introduction 131 



denbaum, and Bruckstein [169] for the graph covering problem, called vertex 
ant walk (YAW) and edge ant walk (EAW), respectively. 

Examples of SNN problems are the max-cut problem, the buffer alloca- 
tion problem, and clustering problems. Examples of SEN problems are the 
travelling salesman problem, the quadratic assignment problem, different de- 
terministic and stochastic scheduling problems and the clique problem. 

In this and the subsequent chapters we demonstrate numerically the high 
accuracy of the CE method by applying it to different case studies from the 
World Wide Web. In particular, we show that the relative error of CE is typi- 
cally within the limits of 1-2% of the best known solution. For some instances 
CE even produced better solutions than the best ones known. For this reason 
we purposely avoided comparing CE with other alternative heuristics such 
as simulated annealing and genetic algorithms. Our goal here is to demon- 
strate its reliability, simplicity, and high speed of convergence. Moreover, we 
wish to establish some mathematical foundations for further research and 
promote it for new applications. We note that various issues dealing with con- 
vergence have been discussed in Section 3.6. For other convergence results see 
[85, 113, 145]. 

Remark 4-1 (Comparison). To compare CE with some other optimization 
method, one could use the following criterion: Let 

= + ’ ( 4 . 1 ) 

7t 

where 7 ^^ is the optimal value of the objective function obtained by the 
method i, i = 1 , 2 , denotes the CPU time by the method i, i = 1, 2 and 
Wi and W 2 are weight factors, with w\ W 2 = I If S < 1 then the first method 
is considered more efficient, and visa versa if S > 1. We leave it up to the 
decision maker to chose the parameter vector {wi^W 2 ). 

The rest of this chapter is organized as follows: In Section 4.2 we for- 
mulate the optimization framework for the CE method and present its main 
algorithm. In Sections 4.3 and 4.4 we discuss separately the SNN- and SEN- 
type combinatorial optimization problems. In Sections 4.5 through 4.8 we 
apply the CE method to the max-cut and partition problem, the TSP, and 
the quadratic assignment problem (QAP), respectively. Numerical results for 
SNN and SEN are presented in Sections 4.9 and 4.10, respectively. Finally, in 
Section 4.11 we give various auxiliary results, including faster tour generating 
algorithms for the TSP and a detailed discussion of sufficient conditions for 
convergence of a CE-like algorithm in which the updating is based on a single 
best sample, rather than the usual qN best samples. 




132 4 Combinatorial Optimization via Cross-Entropy 

4.2 The Main CE Algorithm for Optimization 



The main idea of the CE method for optimization can be stated as follows. 
Suppose we wish to maximize some performance function S (x) over all states 
X in some set JY. Let us denote the maximum by 7*, thus 

7* = maxS'(x) . (4.2) 

xG A’ 

First, we randomize our deterministic problem by defining a family of 
pdfs {/(•; v), V G V} on the set X. Next, we associate with (4.2) the following 
estimation problem 



^{ 1 ) — ^u{S(X.) ^ 7 ) “ ®"U'^{S'(X)^ 7 } 5 (4-3) 

the so-called associated stochastic problem (ASP). Here, X is a random vector 
with pdf /(•; u), for some u G V (for example, X could be a Bernoulli random 
vector) and 7 is a known or unknown parameter. Note that there are in fact 
two possible estimation problems associated with (4.3). For a given 7 we can 
estimate or alternatively, for a given I we can estimate 7, the root of (4.3). 
Let us consider the problem of estimating (, for a certain 7 close to 7*. Then, 
typically {5(X) > 7} is a rare event, and estimation of £ is a nontrivial 
problem. Fortunately, we can use the CE formulas (3.37)-(3.41) to solve this 
efficiently by making adaptive changes to the probability density function 
according to the Kullback-Leibler cross-entropy and thus, creating a sequence 
/(•; u), /(•; Vi), /(•; V2 ), . . . of pdfs that are “steered” in the direction of the 
theoretically optimal density. The following proposition shows that the CE 
optimal density is often the atomic density at x*. 

Proposition 4.2. Let 7* be the maximum value of a real- valued function S 
on a finite set X. Suppose that the corresponding maximizer is unique^ say x* 
and that the class of densities {/(•; v)} to be used in the CE program (3.23) 
contains the atomic or degenerate density with mass at x* : 

o«re„il' <«> 

Then, the solutions of both VM and CE programs (3.15) and (3.23) for esti- 
mating P(5(X) ^7*) coincide and correspond with 

Proof. Let *v be such that 



/(•;*v) = <5,.(.) . (4.5) 

Since the variance of the LR estimator I given in (3.6), under *v, is zero, 
it is obvious that is the optimal reference parameter for (3.15), with 
H{'x) = /{5 (x)^ 7*}- Since, by assumption, {/(*;v)} contains /(•; *v), it is 




4.2 The Main CE Algorithm for Optimization 133 



also immediate that 2)(5x* , /(s v*)) = 0 for v* = ^v; in other words the CE 
and VM solutions coincide. □ 

Proposition 4.2 demonstrates the importance of finite support distribu- 
tions when 7 * is the maximum value of S: The solution of both VM and CE 
programs to estimate P(5(X) ^ 7 *) are often the same, regardless of the dis- 
tribution ofK. In particular this holds when the degenerate density given 
in (4.4) belongs to the family {/(•; v)} of the ASP, that is, when there exists 
some V* € V, such that (4.5) holds. This is typically the case for finite support 
distributions. This has important implications for combinatorial optimization. 
In particular, by associating the underlying combinatorial optimization prob- 
lem (optimizing S) with a rare-event probability of the type F{S{X) ^ 7 *) 
we can obtain an estimate of the “degenerated vector” v* via a CE algorithm 
similar to Algorithm 3.4.1. This in turn gives an estimate of the true optimal 
solution of the COP. 

It is worth mentioning that the assumption of uniqueness of the maximizer 
of S in Proposition 4.2 can be artificially enforced by imposing some ordering 
on the finite set A', say the lexicographical order. Then, we can define a 
function Z on y as Z{x) = S{x) — e(x), where e(x) is a small perturbation, 
proportional to the rank of x and small enough to ensure that Z(x^ ) > ^(x^) 
if 5(x^) > S(x^). In that case, the degenerate measure has mass at the element 
with highest rank within the set of maximizers of 5 . 

Having defined the ASP, we wish to generate, analogously to Chapter 3, a 
sequence of tuples {( 7 t, v*)}, which converges quickly to a small neighborhood 
of the optimal tuple ( 7 *, v*). More specifically, we initialize by setting vq = u, 
choosing a not very small g, say g = 10“^, and based on the fundamental CE 
formulas (3.37)-(3.41) we proceed as follows: 

1 . Adaptive updating of 7 ^. For a fixed Vt-i, let jt be the (1 — ^)-quantile 
of S(X) under Vt_i. That is, jt satisfies (3.37) and (3.38). 

A simple estimator % of jt can be obtained by drawing a random sample 
Xi, . . . , Xjv from /(•; Vt_i) and evaluating the sample (1 — ^)-quantile of 
the performances 

7t = 5'(|-(i_e)jv]) , (4.6) 

as in (3.39). 

2. Adaptive updating of v^. For fixed 7 * and Vt-i, derive Vt from the 
solution of the program 

mMD(v) = ln/(X; v) . (4.7) 

The stochastic counterpart of (4.7) is as follows: for fixed 7 ^ and v^_i, 
derive Vt from the following program 

1 ^ 

m^D(v) - niM ^ E lnf(X,;v). 

2=1 



(4.8) 




134 4 Combinatorial Optimization via Cross-Entropy 



It is important to observe that in contrast to the formulas (3.40) and 
(3.41) (for the rare-event setting), formulas (4.7) and (4.8) do not contain 
the likelihood ratio terms W. The reason is that in the rare-event setting the 
initial (nominal) parameter u is specified in advance and is an essential part of 
the estimation problem. In contrast, the initial reference vector u in the ASP 
is quite arbitrary. In fact, it is beneficial to change the ASP as we go along. 
In effect, by dropping the W term, we efficiently estimate at each iteration t 
the CE optimal reference parameter vector Vt for the rare-event probability 
Pvt(S'(X) ^ 7 t) > Pvt_i(S'(X) ^ 7 t). Of course, we could also include here 
the W terms, but numerical experiments suggest that this will often lead to 
less reliable (noisy) estimates of Vt and v* . 

Remark J^.S (Smoothed Updating). Instead of updating the parameter vector 
V directly via the solution of (4.8) we use the following smoothed version 

vt = avt + (1 - o^)vt_i , (4.9) 

where is the parameter vector obtained from the solution of (4.8), and a is 
called the smoothing parameter., with 0.7 < a < 1 . Clearly, for a = 1 we have 
our original updating rule. The reason for using the smoothed (4.9) instead 
of the original updating rule is twofold: (a) to smooth out the values of v^, 
(b) to reduce the probability that some component Vt^i of will be zero or 
one at the first few iterations. This is particularly important when is a 
vector or matrix of probabilities. Note that for 0 < a < 1 we always have that 
> O 5 while for a = 1 one might have (even at the first iterations) that 
either Vt^i = 0 or Vt^i = 1 for some indices i. As result, the algorithm will 
converge to a wrong solution. 

Thus, the main CE optimization algorithm, which includes smoothed up- 
dating of parameter vector v, can be summarized as follows. 

Algorithm 4.2.1 (Main CE Algorithm for Optimization) 

1. Choose some vq. Set t = 1 (level counter). 

2. Generate a sample Xi,...,Xjv from the density /(-;vt_i) and compute 
the sample (1 — g) -quantile % of the performances according to (4.6). 

3. Use the same sample Xi, . . . , Xjv ond solve the stochastic program (4.8). 
Denote the solution by v* . 

4- Apply (4.9) to smooth out the vector Wt. 

5. If for some t > d, say d = 5, 

7t = 7t-i = • • • = 7t-d ! (4-10) 

then stop (let T denote the final iteration); otherwise set t = t + 1 and 
reiterate from Step 2. 




4.2 The Main CE Algorithm for Optimization 135 



Remark 4-4 (Parameters setting). In our numerical studies with stochastic 
optimization problems we typically take g = 10“^, and the smoothing param- 
eter a = 0.7. We choose N = C R, where C is a constant, say (7 = 5, and 
R is the number of auxiliary distributional parameters introduced into the 
ASP and which need to be eventually estimated in the sense of degenerated 
values as per (4.4). For example, in the max-cut problem we take R = n 
since, as we shall see below, we associate with each vertex an independent 
Ber(pfc), k = 1, . . . , n random variable. Similarly, in the TSP we take R = n? 
since here we associate an n x n probability matrix with the given nxn cost 
or distance matrix between the cities. Note that we assume that n > 100. 
Depending on whether we are dealing with a maximization or minimization 
problem the parameter rj = gN corresponds to the number of sample perfor- 
mances j = 1, . . . , W, that lie either in the upper or lower 100^ percentage. 
We will refer to the latter as elite samples. The choice of ^ = 10“^ allows Al- 
gorithm 4.5.2 to avoid local extrema and to settle down in the global one with 
high probability. 

The suggestion to take a = 0.7, (7 = 5 and g = 10~^ should be viewed 
merely as a rule of thumb. For some problems this combination of parameters 
will lead to the global optimum, but for other problems (a, ( 7 , g) might need 
to be set for example to (0.7,5, 10“^) or (0.3,8, 10“^). A useful technique to 
find the “right” parameters is to run a number of small test problems that 
are derived from the original problem. For more details see Remark 4.13. 

In analogy to Algorithm 3.4.2 we also present the deterministic version of 
Algorithm 4.2.1, which will be used below. 

Algorithm 4.2.2 (Deterministic Version) 

1. Choose some vq. Set t = 1. 

2. Calculate jt as 

= max {s : Pv*_i(5(X) ^ s) ^ g} . (4.11) 

3. Calculate v* as 

\t = axgmaxEv,_i/{s(x)^7t}ln/(X;v) . (4.12) 

V 

4- If for some t>d, say d = 5, 

lt=lt-i = ■■■ =lt-d , (4.13) 

then stop (let T denote the final iteration); otherwise set t = t 1 and 
reiterate from Step 2. 

The main advantage of the CE method versus variance minimization is 
that in most combinatorial optimization problems the updating of Vt (Step 
3) can often be done analytically, that is, there is no need for numerical op- 




136 4 Combinatorial Optimization via Cross-Entropy 



timization. Note also that in general there are many ways to randomize the 
objective function 5 (x) and thus to generate samples from Af, while estimating 
the rare-event probability ^( 7 ) of the ASP. Note finally that it is not always 
immediately clear which way of generating the sample will yield better results 
or easier updating formulas; this issue is more of an art than a science and will 
be discussed in some detail for various COPs in this and following chapters. 
We proceed next with adaptive updating of Vt for SNN-type and SEN-type 
optimization problems. 



4.3 Adaptive Parameter Updating in Stochastic Node 
Networks 

SNN-type problems can be formulated as follows. Consider a complete graph 
with n nodes, labeled 1,2, ...,n. Let x = (xi,...,Xn) be a vector which 
assigns some characteristic Xi to each node i. The vector x can be thought 
of as a “coloring” of the nodes. A typical example is the max-cut problem; 
see Section 2.5.1 and Section 4.5, where we have only two colors, Xi G {0, 1}, 
which determine how the nodes are partitioned into two sets. In Figure 4.1 a 
node coloring is illustrated for the case n = 5 nodes and m = S colors, thus 
X, €{1,2,3}. 



2 




Fig. 4.1. Stochastic Node Network. 



We assume that to each vector x is associated a cost S'(x), which we wish 
to optimize — say maximize — over all x G A" by using Algorithm 4.2.1. Let 
us assume for simplicity that 



X = { 1 , 

The simplest mechanism for “randomizing” our deterministic problem into 
an ASP (4.3) is to draw the components of X = (Ai, . . . , Xn) independently 




4.3 Adaptive Parameter Updating in Stochastic Node Networks 137 



according to a discrete distribution p = fej}, where pij = ¥{Xi = i = 

It follows directly from Example 3.6 that the analytical solution of the 
program (4.8) for the parameters pij is given by 



Pt,ij 



T,k=i 

T,k=ihs(^k)>^t} 



Similarly, the deterministic updating formulas are 

„ _ ^Pt-J{S{X)^'yt}hXi=j] 

]|7 T 



(4.14) 



(4.15) 



where pt_i = {pt-i^ij} denotes the reference parameter (probability distribu- 
tion) at the {t — l)-st iteration. 



4.3.1 Conditional Sampling 



In many SNN-type problems we wish to optimize some function S not over the 
whole set Af = {1, . . . , m}'^ but rather over some subset y of A'. For example, 
we may wish to consider only colorings of the graph in Figure 4.1 that use 
all three colors. Or, for the max-cut problem we may wish to consider only 
equal-sized node partitions. To accommodate these types of problems, we now 
formulate an important generalization of the sample generation and updating 
formulas above. Suppose we wish to maximize some function 5, over a set 



Tc{l,...,mr . 



Define a new function 5(x) = 5(x), for x G T and 5(x) = — oo, else. Clearly, 
maximization of S over y is equivalent to maximization of S over X. For 
the latter problem we can use the CE approach discussed above. That is, we 
draw the components of X independently according to a discrete distribution 
p and update the parameters pij according to (4.15), with S replaced by S. 
By conditioning on the event {X G T}, these updating rules can be written 
as 



Pt,ij 



[■^{5(X)^7t}-^{Xi=j} I X e y] 
Ep.-I [-f{S(X)>7.} I ^ S 



(4.16) 



where we have used the fact that 5(x) = 5(x) on y. In other words, the 
updating rule for this more general SNN problem is very similar to the original 
updating rule (4.15). The only difference is that we need to draw a sample 
from the conditional distribution of X ~ p given X G 3^, instead of from 
the distribution p. This can be done, for example, by using the acceptance- 
rejection method, or by more direct methods, see for example Sections 4.5 
and 4.6. 




138 4 Combinatorial Optimization via Cross- Entropy 

4 . 3.2 Degenerate Distribution 

Assume that the optimal solution 7* to (4.2) for a certain SNN problem 
is unique, with corresponding argument x*. Then, using reasoning similar to 
Proposition 4.2 and (4.4), we can see that the CE and VM optimal distribution 
p for the ASP (4.3) is the optimal degenerate distribution p* with 



As an example, consider the SNN in Figure 4.1, and suppose that the 
optimal vector x* is given as shown, that is x* = (2, 2, 1, 3, 3). Then we have 
P12 — P22 — P31 — P43 — P53 — 1 other entries are 0. 

As we already mentioned, we shall approximate the unknown true solution 
( 7 *,p*) by the sequence of tuples {(7t,Pt)} generated by Algorithm 4.2.1. 






1 if = 3, 
0 otherwise. 



(4.17) 



4.4 Adaptive Parameter Updating in Stochastic Edge 
Networks 

SEN-type problems can be formulated in a similar fashion to SNN-type prob- 
lems. The main difference is that instead of assigning characteristics (colors, 
values) to nodes, we assign them to edges instead. For example, consider the 
first graph of Figure 4.2. Here we assign to each edge (z, j) a value xij which is 
either 1 (bold edge) or 0. Let x = {xij}. Suppose each edge (i, j) has a weight 
or cost Cij. A possible SEN optimization problem could be to find the maxi- 
mum spanning tree, that is to maximize 5(x) = • XijCij over the set X of 

all spanning trees. We can generate random vectors X G A' via (conditional) 
independent sampling from a Bernoulli distribution in exactly the same way 
as described for SNN networks. 



2 




2 




Fig. 4.2. Stochastic Edge Networks. 



4.4 Adaptive Parameter Updating in Stochastic Edge Networks 139 



4.4.1 Parameter Updating for Markov Chains 

A different SEN-type problem is illustrated in the second graph of Figure 4.2. 
Here we wish to construct a random path X through a graph with n nodes, 
labeled 1, 2, . . . , n. This random path can be represented in several ways. For 
example, the path through the sequence of nodes X 2 ^ Xm can 

be represented as the vector x = (xi, . . . , Xm)- The second graph of Figure 4.2 
depicts the m = 4-dimensional vector x = (1, 2, 4, 5) that represents the path 
1 — 2 -> 4 -> 5 through a graph with n = 5 nodes. From now on we identify 
the path with its corresponding path vector. 

A natural way to generate the path X is to draw Xi , . . . , Xm according 
to a Markov chain. Thus, in this case our random mechanism of generating 
samples from X = ,n}^ is determined by a one- step transition matrix 

P = (Pij), such that 

P(Xfc+i =j\Xk=i)=Pij, k = . 

Suppose for simplicity that the initial distribution of the Markov chain is 
fixed. For example, the Markov chain always starts at 1. The logarithm of the 
density of X is given by 

m 

ln/(x;F) = EE ^{xGATy(r)} InPij , 

r=l ij 

where Xij (r) is the set of all paths in X for which the r-th transition is from 
node i to j. The deterministic CE updating rules for this mechanism follow 
from (4.7), with P taking the role of the reference vector v, with the extra 
condition that the rows of P need to sum up to 1 . Using Lagrange multipliers 
we obtain analogously to (2.44)-(2.46) the updating formula 

m 

Pt,i3 = . (4-18) 

r=l 



where Xi{r) is the set of paths for which the r-th transition starts from node 
i. The corresponding estimator is: 



Pt,ij 



N 

X]4{S(Xfc)^7t} 

k=l 



m 

r=l 



N 



n 



k=l r=l 



(4.19) 



This has a very simple interpretation. To update pij we simply take the frac- 
tion of times that the transitions from i to j occurred, taking only those paths 
into account that have a performance greater than or equal to 7 ^. 




140 4 Combinatorial Optimization via Cross-Entropy 

4.4.2 Conditional Sampling 

It is important to realize that the derivation of the updating rules above 
assumes that the paths are generated via a Markov process and that S is 
optimized over the whole set JY. However, similar to the discussion of (4.16) 
for SNNs, many SEN-type problems involve optimization of S over a subset y 
of X. For example, in the second graph of Figure 4.2 one may wish to consider 
only nonintersecting paths; in particular, in the TSP only nonintersecting 
paths that visit all nodes and start and finish at the same node are allowed. 

Fortunately, to tackle these kind of problems we may reason in exactly 
the same way as for (4.16). That is, we obtain for these problems the same 
updating formula (4.19), provided we sample from the conditional distribution 
of the original Markov chain given the event {X G y}. Note that for the TSP 
the sum equal to 1 . As remarked earlier, sampling 

from the conditional distribution can be done via the acceptance-rejection 
method or via more direct techniques; examples are given in Section 4.7. 



4.4.3 Optimal Degenerate Transition Matrix 

Suppose that the optimal solution 7 * to (4.2) for a SEN problem is unique, 
with a corresponding path x*. Similar to (4.17) we can identify x* with a 
“degenerate” transition matrix P* = (p*^). In particular, if the path x* does 
not traverse the same link more than once, then for all i we have = 1 if 
X* contains the link i j; and in that case p*^ = 0 for all k ^ j. For certain 
SENs, such as the TSP, the matrix P* is truly “degenerate” (contains only 
zeros and ones), but for other problems such as the longest path problem 
some rows may remain unspecified. 



4.5 The Max-Cut Problem 

The maximal cut (max-cut) problem in a graph can be formulated as follows. 
Given a graph G(V, E) with set of nodes V = {1, . . . , n} and set of edges E 
between the nodes, partition the nodes of the graph into two arbitrary subsets 
V\ and V 2 such that the sum of the weights (costs) Cij of the edges going from 
one subset to the other is maximized. Note that some of the Cij may be 0, 
indicating that there is, in fact, no edge from iio j. For simplicity we assume 
the graph is not directed, see also Remark 4.5. In that case C is a symmetric 
matrix. 

As an example, consider the graph in Figure 4.3. 




4.5 The Max-Cut Problem 141 




with the corresponding cost matrix 

^0 Ci2 Ci3 0 0 0 ^ 

C21 0 C23 C24 0 0 

C= C34C35O 

U C42 C43 U C45 C46 
0 0 C53 C54 0 C56 

\ 0 0 0 C64 C65 0 

where cij = Cji. Here the cut {{1, 3, 4}, {2, 5, 6}} has cost 
C12 + C32 + C42 + C45 + C46 + C53 . 

As explained in Section 2.5.1, a cut can be conveniently represented by 
its corresponding cut vector x = (xi, . . . , where = 1 if node i belongs 
to same partition as 1, and 0 else. For example the cut in Figure 4.3 can 
be represented via the cut vector (1, 0, 1, 1, 0, 0). For each cut vector x, let 
{V^i(x), F2 (x)} be the partition of V induced by x, such that Vi(x) contains 
the set of indices {i : Xi = 1} . li not stated otherwise we set xi = 1 G Vi. 

Let X be the set of all cut vectors x = (1, X2 , . . . , and let ^(x) be the 
corresponding cost of the cut. Then, 

5(x)= Y. (4-21) 

iGVi(x), jeV 2 (x) 

It is readily seen that the total number of cuts is 

|A'| = 2^-^ - 1 . (4.22) 

Remark 4-5 (Directed graphs). It is important to note that we always assume 
the graph to be undirected. For completeness we mention that for a directed 
graph the cost of a cut {Vi, V2} includes both the cost of the edges from Vi 
to V 2 and from V2 to ^1 • In this case the cost corresponding to a cut vector x 
is therefore 

S(x) = ^ ^ Cij + Cji . (4.23) 

zGV^i(x), j€V2(x) 




142 4 Combinatorial Optimization via Cross-Entropy 



We proceed next with random cut generation and update the correspond- 
ing parameter vector {(7t,Pt)} using the CE Algorithm 4.2.1. 

Since the max-cut problem is a typical SNN-type optimization problem the 
most natural and easiest way to generate the cut vectors is to let X 2 , • • • , Xn be 
independent Bernoulli random variables with success probabilities P2? • • • 

Algorithm 4.5.1 (Random Cut Generation) 

1. Generate an n- dimensional random vector X = (Xi, . . . , Xn) from Ber(p) 
with independent components, where p = (\,p 2 , . . . ,Pn)- 

2. Construct the partition {Fi(X), V2(X)} of V and calculate the perfor- 
mance 5(X) as in (4.21). 

It immediately follows from (4.14) that the updating formula in Algo- 
rithm 4.2.1 at the t-th iteration is given by 

^ _ Ef=l 

Pt,i = r ’ 

2^fe=l ■'{S(Xfc)>7e} 



2 = 2, . . . , n. 

The resulting algorithm for estimating the pair (7*,p*) can be presented 
as follows. 

Algorithm 4.5.2 (Main Algorithm for Max-Cut) 

1. Choose an initial reference vector po with components po,i = 1? = 

2 = 2, . . . ,n. Set t = \. 

2. Generate a sample Xi, . . . ,Xjv of cut vectors via Algorithm J^.5.1, with 
p = Pt-i; and compute the sample (1 — g)-quantile % of the performances 
according to (4.6). 

3. Use the same sample Xi, . . . ,X.n to update p* via (4.24). 

4- Apply (4.9) to smooth out the vector pt. 

5. If for some t > d, say d = 5, 

7t = 7t-i = • • • = 7t-d , (4.25) 

then stop (let T denote the final iteration); otherwise set t = t 1 and 
reiterate from Step 2. 

As an alternative to the estimate 7 t of 7* and to the stopping rule in 
(4.25) one can consider the following: 

4. (*) If for some t = T > d and some d, say d = 5, (4.25) holds, stop and 
deliver 

7 t = max 70 (4.26) 

0<s<T 

as an estimate of 7*. Otherwise, set t = t-\-l and go to Step 2. 




4.5 The Max-Cut Problem 143 



Remark J^.6 (Parallel Computing). The CE method is ideally suited to paral- 
lel computation. Assume that we have r parallel processors and consider, for 
example, the max-cut problem. We can speed up Algorithm 4.5.1 for random 
cut generation and the associated calculation of the performance function S 
by a factor of r by generating on each processor N/r cuts instead of generat- 
ing N cuts on a single processor. Since the time required to calculate (7t,Pt) 
is negligible compared to the time needed to generate the random cuts and 
evaluate the performances, parallel computation will speed up the CE Algo- 
rithm 4.5.2 by a factor of r, approximately. Similar speed-ups can be achieved 
for other combinatorial optimization problems. 

The following toy example illustrates the performance of the deterministic 
version of Algorithm 4.5.2 step by step for a simple 5-node max-cut problem, 
where the total number of cuts is \X\ = — 1 = 15; see (4.22). The small 

size of the problem allows us to make all calculations analytically, that is using 
directly the deterministic Algorithm 4.2.2 (see (4.11), (4.12)) rather than their 
stochastic counterparts. 

Example J^.l (Illustration of Algorithm 4-2.2). Consider the 5-node graph pre- 
sented in Figure 4.4. The 15 possible cuts and the corresponding cut values 
are given in Table 4.1. 







/O 1 3 5 6\ 






1 0 3 6 5 






3 3 0 2 2 






5 6 2 0 2 






i^6 5 2 2 0/ 



(4.27) 



Fig. 4.4. A 5-node network with cost matrix C. 



Table 4.1. The 15 possible cuts of Example 4.7. 



X 


Vi 


Vi 


5(X) 


(1,0, 0,0,0) 


{1} 


{2, 3, 4, 5} 


15 


(1,1, 0,0,0) 


{1,2} 


{3,4,5} 


28 


(1,0,1,0,0) 


{1,3} 


{2,4,5} 


19 


(1,0,0,1,0) 


{1,4} 


{2,3,5} 


20 


(1,0, 0,0,1) 


{1.5} 


{2,3,4} 


18 


(1,1,1,0,0) 


{1,2,3} 


{4,5} 


26 


(1,1,0,1,0) 


{1,2,4} 


{3,5} 


21 


(1,1,0, 0,1) 


{1,2,5} 


{3.4} 


21 


(1,0,1,1,0) 


{1,3,4} 


{2,5} 


20 


(1,0,1,0,1) 


{1,3,5} 


{2,4} 


18 


(1,0,0,1,1) 


{1,4,5} 


{2,3} 


19 


(1,1,1,1,0) 


{1,2, 3, 4} 


{5} 


15 


(1,1,1,0,1) 


{1,2, 3, 5} 


{4} 


15 


(1,1,0,1,1) 


{1,2, 4, 5} 


{3} 


10 


(1,0, 1,1,1) 


{1,3, 4, 5} 


{2} 


15 


(1,1,1,1,1) 


{1,2, 3, 4, 5} 


0 


0 




144 4 Combinatorial Optimization via Cross-Entropy 



It follows that in this case the optimal cut vector is x* = (1, 1, 0, 0, 0) with 
5(x*) = 7* = 28. 

We shall show next that in the deterministic Algorithm 4.2.2 adapted 
to the max-cut problem the parameter vectors po,pi,... converge to the 
optimal p* = (1,1,0, 0,0) after two iterations, when g = 10“^ and po = 
(1,1/2, 1/2, 1/2, 1/2). 



Iteration 1 

In the first step of the first iteration, we have to determine 71 from 

7t = max {7 : Ep,_j/{s(x)^T,} ^ 0.1} . (4.28) 

It is readily seen that under the parameter vector po, S'(X) takes val- 
ues, {0,10,15,18, 19,20,21,26,28} with probabilities (1/16,1/16,4/16, 2/16, 
2/16, 2/16, 2/16, 1/16, 1/16}. Hence, we find 71 = 26. In the second step, we 
need to solve 

Pt = argmax ®Pt-i -^{5(x)^7t} ln/(X;p) , (4.29) 

p 

which has the solution 



^ Ept_i/{g(x)>7t}V» 

Ept-i.f{5(X)>7e} 

There are only two vectors x for which 5(x) ^ 26, namely (1, 1, 1,0,0) and 
(1, 1,0, 0,0), and both have probability 1/16 under po. Thus, 



Phi = S 



r ^ 
2/16 
1/16 

2/16 

0 



I 2/16 



= 1 

_ 1 
“ 2 
= 0 



for i = 1, 2 , 
for i = 3 , 
for i = 4, 5 . 



Iteration 2 

In the second iteration, S{X.) is 26 or 28 with probability 1/2. Applying (4.28) 
and (4.29) again yields the optimal 72 = 28 and the optimal p2 = (1, 1, 0, 0, 0), 
respectively. 

Remark J^.S (Alternative Stopping Rule). Note that the stopping rule (4.25), 
which is based on convergence of the sequence {7^} to 7*, stops Algorithm 
4.5.2 when the sequence {7^} does not change. As an alternative stopping rule 
we consider the one which is based on convergence of sequence of {p^}, rather 
than on convergence of {7^}. According to that stopping rule. Algorithm 4.5.2 
stops when the sequence {p^} is very close to the optimal “degenerated,” (i.e, 
zero-one) reference vector p*, provided p* is a unique solution. 




4.6 The Partition Problem 145 



Formally, let ^ ^ denote the ordered components of the 

vector 

Pt = (Pt,i) • • • ^Pt,n)- Let At = (ii, . . . ,in) be the corresponding sequence 
of indices. Thus, 

Pt,ik ~ Pt,{k) ? fc = 1, . . . , 71 . 

Let r denote the number of unities in the optimal degenerate (binary) ref- 
erence vector. Denote by % the minimal index i in At for which pt^i >1/2. 
Thus, by definition, pt^(i) > 1/2 for and pt^(i) ^ 1/2 for i < ft. It follows 
from the above that for each t, the index % can be considered as an estimate 
of the true r. For a given sequence At, let At{k,, . . , N) denote the sequence 
consisting of the last k elements of At. As our alternative stopping rule we 
use the following: Stop if for some t = T > d and some d, say d = 5, 

At{?t,...,N) = At-i{Tt-i,...,N) = ■■■ = At-d(n-d,---,N), t>d . 

(4.30) 

In other words, we stop if the indices of the components of the vector 
(Pt,(i)? • • • ?Pt,(n)) fbat are greater than 1/2 remain the same for d consecu- 
tive iterations. 



The Max-Cut Problem with r Partitions 

We can readily extend the max-cut procedure to the case where the node set 
V is partitioned into r > 2 subsets {Vi, . . . , W} such that the sum of edge 
weights between any pair of subsets Va and VJ,, a,b = l,...,r, (a < b) is 
maximized. Thus for each partition (Vi, . . . , W} the value of the objective 
function is 

r r 

^ XI • 

a=l b=a+l ieVa, jEVb 

In this case one can follow the basic steps of Algorithm 4.2.1 using inde- 
pendent r-point distributions instead of independent Bernoulli distributions 
and update the probabilities exactly as in (4.14). 



4.6 The Partition Problem 

The partition problem is similar to the max-cut problem. The only difference 
is that the size of each class is fixed in advance. This has implications for the 
trajectory generation. Consider, for example, a partition problem in which V 
has to be partitioned into two equal sets, assuming n is even. We cannot sim- 
ply use Algorithm 4.5.1 for the random cut generation, since most partitions 
generated in that way will have unequal size. 




146 4 Combinatorial Optimization via Cross-Entropy 



We describe next a simple algorithm for the generation of a random par- 
tition {^1,1^2} with exactly m elements in V\ and n — m elements in V2. 
This is called a hipartition generation problem. Extension of the algorithm to 
r-partition generation is simple. 

For a given vector p = (pi, . . . ,Pn), with pi not fixed to 1 this time, we 
draw the partition vector X = (Xi, . . . ,Xn) according to X ~ Ber(p) with 
independent components conditional upon the event that X has exactly m 
ones and n—m zeros. To achieve this we introduce first an auxiliary mechanism 
which draws a random permutation II = (i7i, . . . , Iln) of (1, . . . , n) uniformly 
over the space of all permutations. Drawing such a random permutation is 
very easy: simply let Ui,. . . ,Un be i.i.d. U(0, 1)— distributed. Then, arrange 
the Ui^s in nondecreasing order, that is, as Uui ^ Uu 2 ^ * * * ^ 
finally the indices 77 = (77i, Iln) as the required random permutation. 

We demonstrate our algorithm first for a 5-node network assuming m — 2 
and m — n = 3. Similar to the max-cut problem we assume that the vector 
p = (pi, . • • jPs) in Ber(p) is given and then proceed as follows. 

1. Generate a random permutation 77 = (77i, . . . , TTs) of (1,...,5) uni- 
formly over the space of all such permutations. Let 77 = (tti, . . . , tts) 
be a particular outcome. Suppose, for example, that the permutation is 
(7 Ti,..., 7r5) = (3,5, 1,2,4). 

2. Given 77 = (tti, . . . , tts) generate independent Bernoulli random variables 
Xtti, X 7T2, . • . from Ber(pTri), Ber(p7T2), • • • , respectively, until either exactly 
m = 2 unities ot n — m = 3 zeros are generated. Note that the number of 
samples is a random variable with the range from min{m, n — m} to n. 
Suppose, for example, that the first four independent Bernoulli samples 
from the above Ber(p3), Ber(p5), Ber(pi), Ber(p2) result in the following 
outcome (0,0, 1,0). Since we already generated three zeros we can set 
X4 = 1 and deliver {Vi(X), V2(X)} = {(1, 4), (2, 3, 5)} as the desired 
partition. 

3. If in the previous step m = 2 unities are generated, set the remaining 
three elements to zero; if on the other hand three zeros are generated, set 
the remaining two elements to unities and deliver X = (Xi, . . . ,X^) as 
the final partition vector. Construct the partition {Vi(X), 1^(X)} of V. 

With this example at hand, the random partition generation algorithm 
can be written as follows: 




4.7 The Travelling Salesman Problem 147 



Algorithm 4.6.1 (Random Partition Generation Algorithm) 

1. Generate a random permutation 77 = (77i, . . . , Tin) o/(l, . . . , n) uniformly 
over the space of all such permutations. 

2. Given 77 = (tti, . . . , independently generate Bernoulli random vari- 
ables Atti; • • • from Ber(p 7 ri )5 Ber(p 7 T 2 )? • • v respectively, until m uni- 
ties or n — m zeros are generated. 

3. If in the previous step m unities are generated, set the remaining elements 
to zero; if on the other hand n — m unities are generated, set the remaining 
elements to one. Deliver X = {Xi , . . . , Xn) as the final partition vector. 

4- Gonstruct the partition {Fi(X), T72(X)} of V and calculate the perfor- 
mance S'(X) according to (4.21). 

In effect, what Algorithm 4.6.1 does is sample from the conditional distri- 
bution of X ~ Ber(p) conditional upon the fact that ^ or n — m. 

It follows, as in (4.16), that the updating formula for the reference vector p 
remains exactly the same as in (4.24). 

Remark 4-9 (Alternative Stopping Rule). Similar to the max-cut case, we can 
formulate an alternative stopping rule, which is based on convergence of se- 
quence of {pt}, rather than on convergence of {7t}. Since in the partition 
problem the number of nodes in the partition is fixed, the stopping criterion 
is simpler. Indeed, consider the bipartition problem where the sizes of the 
partitions are m and n — m, and the node 1 belongs to the partition of size 
m, {Xi = 1). Then in analogy with (4.30) we can use the following stopping 
criterion: Stop if for some t = T > d and some d, say d = 5, 

At{N -m + l,...,N) = At-i{N-m + l,...,N) (4.31) 

= • • • = At-d{N - m + 1, . . . , N), t> d . 

In other words, we stop when the indices of the first m largest components of 
the ordered vector pt coincide d times in a row. 



4.7 The Travelling Salesman Problem 



The objective of the travelling salesman problem (TSP), see Section 2.5.2, is 
to find the shortest tour in a graph containing n nodes. We will assume that 
the graph is complete, that is each node is connected to each other node. The 
TSP can be formulated as follows. Find: 



min5(x) = min 

X X 



n— 1 

^Xi,Xi^l + ^Xn,Xx 

i=l 



(4.32) 



where x = (xi, X 2 , . . . , x^) denotes a permutation of (l,...,n), Xi, i = 
1,2, ...,n is the z-th city to be visited in the tour represented by x, and 
Cij is the distance (travelling cost) from city i to city j. 




148 4 Combinatorial Optimization via Cross-Entropy 



It will be convenient to assume that the graph is complete^ that is, that 
each node is connected to each other node. Note that this does not lead to 
loss of generality since for a noncomplete graph we can always add additional 
edges with infinite cost without affecting the minimal tour. This is illustrated 
in Figure 4.5, with the corresponding cost matrix 



/ 00 Ci2 Ci3 OO OO Ci6 \ 

C21 OO C 23 C 24 OO OO 

C3I C32 00 00 C35 00 

00 C42 00 00 C45 C40 

00 00 C53 ^56 

y C01 00 00 C04 ^65 J 



(4.33) 




3 5 



Fig. 4.5. A TSP with 6 cities. Dotted edges have 00 cost. 



Note that each permutation x = (xi, . . . , Xn) corresponds to a unique tour 
x\ ^ X 2 — > . . . ^ ^ 1 , so that there will be no confusion when we call x 

a “tour.” For simplicity we always assume that x\ = 1. For example, the TSP 
represented by (4.33) has only the following 6 finite-cost tours/permutations: 



(1,2, 3 , 5 , 4, 6), 


(1,2, 4, 6, 5, 3), 


(1,3, 2, 4, 5, 6), 


(4.34) 


(1,6, 4, 5,3,2), 


(1,3, 5, 6, 4, 2), 


(1,6, 5, 4, 2, 3), 


(4.35) 



out of a total of 5! = 120 possible tours. 

Remark 4 . 10 (Forward and Backward Loop). For A TSP with a symmetric 
cost matrix C there are always at least two optimal tours. For example, if in 
Figure 4.5 the cost matrix is symmetric, and if (1, 2, 3, 5, 4, 6) is an optimal 
tour, then (1, 6, 4, 5, 3, 2) is also an optimal tour. We can call one tour the 
forward tour and the other the backward tour. To distinguish between the 
forward and the backward tour one can use the following simple rule: for a 
forward tour X 2 < Xn and for a backward tour X 2 > Xn- 



4.7 The Travelling Salesman Problem 149 



Let X be the set of all possible tours and let 5(x) in (4.32) be the length 
of a tour X G A'. For example, the length of the tour (1,2, 3, 5, 4, 6) is 

5 (( 1 , 2, 3 , 5 , 4 , 6)) = ci2 + C 23 + C35 -h C54 + C46 + cei . 

Our goal is to minimize S in (4.32) over the set X using the CE method. 
To minimize S we need to specify a mechanism for the generation of random 
tours X G A', and, as usual, to adopt the basic cross-entropy Algorithm 4.2.1 
to update the parameters of the distributions associated with this mechanism. 

Taking into the account that the travelling salesman problem presents a 
SEN-type optimization problem we present now our first random trajectory 
generation algorithm, which is based on the transitions through the nodes of 
the network. More specifically, to generate a random tour X = (Xi, . . . , Xn) 
we use an n X n transition probability matrix P and then proceed according 
to Algorithm 2.5.1, which is repeated below as Algorithm 4.7.1 for ease of ref- 
erence. Recall that as usual we include in the trajectory generation algorithm 
an extra step for the objective function calculation. 

Algorithm 4.7.1 (Trajectory Generation Using Node Transitions) 

1. Define = P and Xi = 1. Let k = 1. 

2. Obtain the matrix from by first setting the Xk-th column of 

P^^^ to 0 and then normalizing the rows to sum up to 1. Generate 
from the distribution formed by the X^-th row of 

3. Ifk = n — 1 then stop; otherwise set k = k-\-l and reiterate from Step 2. 

4 . Evaluate the length of the tour via (4.32). 

A more detailed description of this algorithm is given in Algorithm 4.11.1 
in the appendix to this chapter. 

Remark >^.77 (Starting City). Algorithm 4.7.1 can be easily modified by start- 
ing it from different cities, instead of always from city 1. For example, we 
could “cycle” through the starting cities for each consecutive tour generation. 
Alternatively, we could generate the initial city according to some random 
mechanism, for example, choose the starting city by drawing from a discrete 
uniform distribution on {1, . . . ,n}. A possible reason for not always wanting 
to start at 1 is that such a choice could introduce some bias toward the way 
in which the trajectories are generated. However, we have never observed this 
in practice. 

It is crucial to understand that Algorithm 4.7.1 above is an example of 
the generating rule introduced in Section 4.4. Specifically, in Algorithm 4.7.1 
we generate a stochastic process {Xi, . . . ,Xn} according to the conditional 
distribution of the original Markov chain, given that the path (Xi, . . . ,X^) 
constitutes a genuine tour, that is, each node except for 1 is visited only once. 
Note that the stochastic process {Xi, . . . , Xn} is no longer a Markov process. 




150 4 Combinatorial Optimization via Cross-Entropy 



Note also that a Markov chain with transition matrix P would in general not 
yield tours because nodes could be repeated. The situation with (Xi, . . . , Xn) 
is somewhat akin to the classical urn problem in which a number m of balls is 
chosen from an urn with n balls numbered 1, . . . n. The original Markov chain 
would correspond to sampling with replacement, whereas the “conditional” 
process would correspond to sampling without replacement. For this reason we 
call the second process a Markov chain with replacement (MCWR), although 
formally it is not a Markov chain. As explained in Section 4.4.2 (and, in a 
more informal way, in Section 2.5.2), using the MCWR instead of the original 
Markov chain will not change the updating rules for the transition matrix 
P, because we are sampling from the conditional distribution of the Markov 
chain, given that the paths form a genuine tour. 

Therefore, in analogy to (4.18) and (4.19) we update the components of 
P for the deterministic and stochastic case as 



Ep,_i/{s(XK7.} 



Pt,iJ = > (4-37) 

k=l 

respectively. Here Xij denotes the set of tours for which the transition from 
node i to node j is made. 

Let us return to Algorithm (4.7.1) and consider the corresponding optimal 
degenerate transition matrix P* (see Section 4.4.3). Suppose 7* is the optimal 
length of the smallest tour, and there is only one such tour x* with length 7*. 
It is not difficult to see that for any Pt-i the solution of the CE program (4.8) 
is given by the optimal degenerate transition matrix (ODTM) P* = (p*j), 
given by 

, ^ r 1 if X* G Xij, 

\ 0 otherwise. 

As an example, consider the TSP in Figure 4.5. If, for instance, the trajectory 



1 ^ 2 3 5 -> 4 -> 6 ^ 1 



corresponds to the shortest tour, then we have 

/O 1 0000\ 
001000 
000010 
“ 000001 
000100 
\ 100000 / 



(4.38) 




4.7 The Travelling Salesman Problem 151 



Remark 4-^2 (Alias Method). The (A;-hl)-st node in Algorithm 4.7.1 is gener- 
ated from a discrete n-point pdf (the X^-th row of A straightforward 

way to do this is to use the inverse-transform technique; see Algorithm 1.7.2. 
However, this is quite time consuming, especially when n is large. Instead of 
calculating at each step the entries of from those of P, one can use 

as an alternative the acceptance-rejection method; see Section 1.7.4. Here, 
one selects the states in a tour such that for a given “previous” state i the 
“next” state j is generated according to the pij from the original P. And 
then, depending on whether state j was previously visited or not, one either 
rejects or accepts the associated transition i ^ j. The efficiency of such an 
acceptance-rejection typically decreases dramatically towards the end of the 
tour. Indeed, let n = 100 and assume that in a specific tour we have already 
visited 90 different cities. In such case, before moving from city 90 to city 91 
the acceptance-rejection method will reject on average 10 trials before making 
a successful one. 

The trajectory generation via the acceptance-rejection method can be sub- 
stantially increased by combining it with the alias method (see Section 1.7.2). 
Generation of random variables from an arbitrary discrete distribution by the 
alias method is much faster than by the inverse-transform method, especially 
for large n. For more details on the implementation of the alias method for 
SEN problems we refer to [113] and the appendix of this chapter. It was found 
empirically in [113] that the following combination is the fastest: For the first 
85% of visited cities use a combination of the alias with the acceptance- 
rejection and for the remaining 15% of cities in the tour use the inverse- 
transform method. 



We present now an alternative algorithm for trajectory generation in TSP, 
where alias method has been used as well. This algorithm is due to Margolin 
[113] and is called the placement algorithm. Recall that Algorithm 4.7.1 gener- 
ates transitions from node to node, based on the transition matrix P = {pij). 
In contrast, in Algorithm 4.7.2 below a similar matrix 



^ P(l,l) P(l,2) • ■ • V{l,n) 

P(2,l) P(2,2) • • • P(2,n) 
\P{n,l) P{n,2) • • • P{n,n) / 



(4.39) 



generates node placements. Specifically, P{i^j) corresponds to the probability of 
node i being visited at the j-th place in a tour of n cities. In other words, P(ij) 
can be viewed as probability that city (node) i is “arranged” to be visited at 
the j-th place in a tour of n cities. 




152 4 Combinatorial Optimization via Cross-Entropy 



More formally, a node placement vector is a vector y = ( 2 / 1 , . . • ,yn) such 
that Pi denotes the “place” of node i in the tour x = (xi, . . . , Xn). The precise 
meaning is given by the correspondence. 

Vi = j Xj = i , (4.40) 

for all 2 , j G {l,...,n}. For example, the node placement vector y = 
(3, 4, 2, 6, 5, 1) in a 6-node network, such as in Figure 4.5, defines uniquely 
the tour x = (6, 3, 1, 2, 5, 4). The performance of each node placement y can 
be defined as S{y) = S (pc), where x is the unique path corresponding to y. 

Algorithm 4.7.2 (Trajectory Generation Using Node Placements) 

1. Define = P. Let k = 1. 

2. Generate Yk from the distribution formed by the k-th row of P^^\ Obtain 

the matrix from P^^^ by first setting the Yk-th column of P^^^ to 0 

and then normalizing the rows to sum up to 1. 

3. If k = n then stop; otherwise set k = k 1 and reiterate from Step 2. 

4 . Determine the tour by (4.40) and evaluate the length of the tour S via 
(4.32). 

A more detailed description of this algorithm is given in Algorithm 4.11.2 
in the appendix of this chapter. 

Note that in contrast to Algorithm 4.7.1, where the random tours X are 
generated via the MCWR, we now generate the node placements Y using 
a conditional distribution formed by the corresponding row of the matrix P 
(see Step 2 of Algorithm 4.7.2). It follows from Section 4.3.1 that the updating 
formula for Pt,{ij) is given by 



N 

Y ^{S(YkKyt} ^{yki=j} 

• (4.41) 

fc=l 

Our simulation results with the TSP and with other SEN models do not 
indicate any superiority among the two (Algorithm 4.7.2 and Algorithm 4.7.1) 
in terms of the efficiency (speed and the accuracy) of the main CE Algorithm 
4.2.1. 

For easy reference we present now the main CE algorithm for TSP. 




4.7 The Travelling Salesman Problem 153 



Algorithm 4.7.3 (Main CE Algorithm for TSP) 

1. Choose an initial reference transition matrix Pq, say with all off -diagonal 
elements equal to l/{n — l). Set t = 1. 

2. Generate a sample Xi,...,Xiv of tours via Algorithm with P = 

Pt-i, and compute the sample (1 — g)-quantile % of the performances 
according to (4.6). 

3. Use the same sample Xi, . . . ,Xjv to update Pt via (4.37). 

Apply (4.9) to smooth out the matrix Pf. 

5. If for some t>d, say d = 5, 



7f = 7t-i = • • • = %-d , (4.42) 

then stop (let T denote the final iteration); otherwise set t = t 1 and 

reiterate from Step 2. 

Recall that the goal of Algorithm 4.7.3 is to generate a sequence {(7^, Pt)} 
which converges to a fixed point (7*,P*), where 7* is the optimal value of 
the objective function in the network and P* is the ODTM, which uniquely 
defines the optimal tour x*. Also, it is essential that at each step of Algorithm 
4.7.3 all components of P corresponding to the nodes that we have not yet 
visited are nonzero. To insure this, we apply the smoothed version (4.9) to 
Pt,ij • 

Remark (Parameter Setting). Similarly to Remark 4.4 we typically set 
for the TSP and for general SEN-type problems g = 10“^ and a = 0.7. We 
take for SEN the sample size N = const • n^, like N = 5n^, rather than 
N = const • n (for SNN), since the number of parameters to estimate in the 
probability matrix P is rather than n as in the max-cut problem, say. 

A more ingenuous approach to selecting the parameters, which appears to 
work well, is to run the CE algorithm on a “miniature” of the original TSP, 
selecting at random a small percentage, say 10-20%, of the original nodes, 
while preserving the cost structure. For this miniature TSP good choices for 
a, C, and g can be obtained quickly (e.g., by trial and error). Now, use these 
parameters for the larger TSP. 



Remark 4’ 14 (Alternative stopping criterion). An alternative stopping crite- 
rion is based on the convergence of the sequence of matrices Pq,Pi,P 2 — 
Specifically, the algorithm will terminate at iteration T if for some integer d, 
for example, d = 5, 

Ct(«) = €t-i(*) = ••• = Cr-rf(i)> for alH = , (4.43) 

where ^t (0 denotes the index of the maximal element of the z-th row of Pt- 




154 4 Combinatorial Optimization via Cross- Entropy 



Remark 4-^5. We show now how to calculate the minimal number of trajec- 
tories (sample size) N^{n) required in order to pass with high probability ( 
at least once through all transitions i ^ j. More specifically, the sample size 
N^{n) will guarantee that all transition probabilities in (4.37) will be positive 
with high probability, provided all off-diagonal elements of Pq are l/(n — 1). 

The quantity N^{n) can be calculated analytically by using the theory 
of urn models. More precisely, N^{n) can be calculated using the inclusion- 
exclusion principle for occupancy problems ([90], page 108). This is the same 
as finding the probability that all urns are occupied (no single transition 
probabilities in (4.37) will be zero), while distributing N balls into n cells 
(generating N trajectories through the network containing n nodes). 

Instead of performing tedious calculations with such urn models (see [90]) 
we present below some simulation studies according to which we found that 
for 50 < n < 1000, the quantity N^{n) can be approximated as N^{n) = CnU^ 
where Cn ~ e • Inn. Table 4.2 presents the results of such simulation study 
with fixed C = 0.999. 

Table 4.2. N*(n) as a function of n for C = 0.999. 



n 


iV,(n) 


Cn 


Inn 


50 


400 


8 


3.9 


100 


900 


9 


4.6 


250 


2,500 


10 


5.5 


500 


6,000 


12 


6.2 


750 


9,750 


13 


6.6 


1000 


14,000 


14 


6.9 



Note that the results of Table 4.2 are valid only under the assumption that 
the off-diagonal elements Po,zj of Po are equal. This typically holds only for the 
first iteration of Algorithm 4.7.3. It is still interesting to note that for a TSP 
with (n — 1)! trajectories, one needs only N^{n) = O(nlnn) trajectories to 
pass with high probability ^ at least once through all possible transitions. Note 
finally that in all our simulation results with TSP we took at each iteration 
of Algorithm 4.7.3 a sample oi N = O(n^) trajectories, which is larger than 
N^{n) = O(nlnn). 



4.8 The Quadratic Assignment Problem 

The quadratic assignment problem (QAP) has remained one of the challenges 
in combinatorial optimization. Prom a computational complexity point of view 
the QAP is one of the most difficult problems to solve, and it is still considered 




4.8 The Quadratic Assignment Problem 155 



a computationally nontrivial task to solve modest size problems, say with 
n = 20. 

The applications of the QAP include optimal allocation, scheduling, man- 
ufacturing, parallel and distributed computing and statistical data analysis. 
We consider the QAP in the context of location theory , where the objective 
is to find an assignment of a set of facilities to a set of locations such that the 
total cost of the assignment is minimized. 

Mathematically, the QAP can be formulated as follows: given a, set V = 
{1,2,..., n} and three nxn input matrices F = (/^j), D = (dki) and E = (e^/), 
minimize the function S given by 

n n n 

i=l j=l i=l 



over all permutations x = (xi, . . . ,Xn) of V. The matrix F = (fij) is called 
the flow matrix, that is, flj is the fiow of materials from facility i to facility 
j, D = (dki) is the distance matrix, that is, dki represents the distance from 
location k to location /, and E = (eu) is the direct cost matrix, that is, eu 
represents the cost of placing the facility i to location 1. The matrix E is often 
set to zero and it does not appear in most case studies. However, it can be 
easily added to the objective function and incorporated in the CE algorithm 
described below. 

Consider, for example, a three dimensional QAP with the following ma- 
trices D and F 



0 73\ 


/039 


70 2 , 


F= 506 


3 20/ 


\430 



In this case, the total number of possible allocations is n! = 3! = 6, and 
the corresponding allocations are (1,2,3); (1,3,2); (2,1,3); (2,3,1); (3,1,2); 
(3,2,1). 

For example, the first allocation implies that the first facility is allocated to 
place 1, the second one is allocated to the second place, and the third facility 
is allocated to the third place, and similarly for the remaining 5 allocations. 
The assignment costs (the objective function) of all 6 allocations are 



^((l, 2, 3)) = di2fi2 4- disfis + 0 ^ 21/21 + d2sf2s + d^ifsi + ^^ 32/32 = HI 

5((1, 3, 2)) = di 2 fi 3 + disfi 2 + <^ 21/31 + <^ 23/32 + <^ 31/21 + <^ 32/23 = 133 

5((2, 1,3)) = 6 ^ 12/21 + disf 2 s + d 2 ifi 2 + d2sfi3 + <^ 31/32 + <^ 32/31 = 

*S'((2, 3, 1)) = 6 ^ 12/23 + <^ 13/21 + d2lf32 + <^ 23/31 + d3ifi2 + d32fl3 = HO 

5((3, 1, 2)) = di 2 f 3 i + di 3 f 32 + <^ 21/13 + <^ 23/12 + <^ 31/23 + <^ 32/21 = 134 

5((3, 2, 1)) = (^ 12/32 H" di3f3i 4- 6 ^ 31/23 4- ^^ 23/21 + c^3i/i3 4- c?32/i2 = 118 




156 4 Combinatorial Optimization via Cross-Entropy 



It is well known that, with an appropriate choice of the coefficients of the 
matrices F and D, the TSP, the packing problem, and the maximum clique 
problem can be viewed as special cases of the QAP. 

Note that both the trajectory generation algorithms and the main Algo- 
rithm 4.7.3 for TSP remain the same for QAP, provided the components Cij 
of the cost matrix C = {cij) in TSP are replaced with 

n n 

~ ^ ^ ^ ^ fij ^kl • 
k=l 1=1 

Remark 4 . 16. The term “quadratic” in the QAP comes from the reformula- 
tion of the problem as an optimization problem with a quadratic objective 
function. The reasoning is as follows. First, observe that there is a one-to-one 
correspondence between the set of all permutations X and the set of n x n 
permutation matrices {yik)^ which must satisfy: 



and 

Vik C 

We can interpret yik as 
y%k 



={J; 



r-T 

II 

r — 1 

II 


(4.45) 


i=l 




n 

^ ^ Vik — I 5 ^ = 


(4.46) 


k=l 




{0,1}; z = l,...,n; A: = l,...,n. 


(4.47) 


if facility i is assigned to location k, 
' otherwise. 


(4.48) 



Note that the cost of simultaneously assigning facility i to location k and 
facility j to location I is fijdkh As a result, taking (4.45)-(4.48) into account, 
we can rewrite the right-hand side of (4.44) in the following equivalent form 



n n n n 



n n 



F. X/ yj'- ^ ^ik yik 

i=l j=l k=l 1=1 i=l k=l 



(4.49) 



4.9 Numerical Results for SNNs 

Below we present numerical results for Algorithm 4.5.2 for the max-cut and 
partition problems. The emphasis will be on the numerical convergence of 
Algorithm 4.5.2, that is on the convergence of estimator % (see (4.25)) to the 
true unknown optimal value 7 *. 

For the bipartition and the max-cut problem (with number of partitions 
r = 2) we always choose with the initial vector po = (1, |, . . . , |). If not stated 




4.9 Numerical Results for SNNs 157 



otherwise we set the parameters as follows (see also Remark 4.4): ^ = 10 
OL = 0.7, A/' = 5 • n • I and use the stopping rule (4.25) with the parameter 
d = b. 

Table 4.3 lists the main notation used in the tables for SNN problems. 



Table 4.3. Notation used in tables for SNN-type experiments. 



jt the worst performance of the elite samples obtained at iteration t, 

St,(N) the best elite performance at iteration t, 
p tmin = : pt,i > 0.5, i = 1, . . . , n}, 

p J^max = max{pt,i : pt,i < 0.5, z = 1, . . . , n}, 
e the relative experimental error of the final solution, that is. 



e = 




(4.50) 



4.9.1 Synthetic Max-Cut Problem 

We return to the synthetic max-cut problem of Section 2.5.1. Thus, our sym- 
metric cost matrix is of the form 

\^21 ^22 / 

where Zu and Z 22 are square matrices of dimensions m and n — m, respec- 
tively, whose elements are strictly less than 5, and C 12 and C 21 are matrices 
with identical elements c > n{n — m)lm. The optimal cut V* is given by 
Y* z=z {{1^ . . . ^ 777 ,}, {m -hi,..., n}}, and the optimal performance is 

= cm (n — m) , 

We generate the elements of the matrices Zu according to some fixed 
probability distribution with support in [0,6), for example a uniform or Beta 
distribution. We write symbolically Z ~ U(a, 6), if the Zu are generated via 
the U(a, 6) distribution. A similar notation is used for other distributions. Note 
that only the elements above the diagonal of each Zu need to be generated, 
since the cost matrix is symmetric, with 0 diagonal elements. 

Table 4.4 presents the relative experimental errors Si, z = !,••• ,6 and 
the final iteration number T^, z = !,•••, 6 asa function of the sample size 
N for the max- cut problem with n = 200, m = 100, c = 5 and for the 
following 6 cases: Zi = 4.0, Z 2 = 4.9, Z^ ~ U(l,5), Z 4 ~ U(4.5,5.0), Z 5 ~ 
Beta(100, 1, 1, 5), Z^ ~ Beta(100, 1,4.5, 5). Here Beta(a,/?, a, 6) denotes the 
(generalized) Beta distribution on the interval (a, 6), with pdf 




158 4 Combinatorial Optimization via Cross- Entropy 



mm “ ■ **'’'‘‘* ■ ‘ 

We found that for N > 10 n the relative experimental error of the final 
solution, denoted by e in (4.50), equals zero, that is. Algorithm 4.5.2 is exact 
(always finds the optimal solution). 



Table 4.4. The relative experimental errors Si, i = 1, - ,6 (and the associated 

stopping times Ti, i = 1, • • • , 6) as functions of the sample size N for the max-cut 
problems with n = 200, m = 100 and c = 5. 



N 


ei(ri) 


£2(T2) 


e3(T3) 


sa{T^) 


£ 5 ( 75 ) 


sq{Tq) 


200 


0.078(11) 


0.009(12) 


0.179(11) 


0.023(11) 


0.004(11) 


4.6e-4(12) 


400 


0.070(12) 


0.005(14) 


0.091(14) 


0.014(14) 


0.003(14) 


4.2e-4(16) 


800 


0.013(15) 


0.003(15) 


0.072(17) 


0.005(16) 


0.002(17) 


2.0e-4(19) 


1200 


0.004(15) 


0.001(16) 


0.027(16) 


0.002(16) 


5.0e-4(18) 


1.5e-4(20) 


1600 


0.002(17) 


0.000(16) 


0.000(15) 


0.001(16) 


3.0e-4(19) 


6.0e-5(20) 


2000 


0.000(16) 


0.000(17) 


0.000(15) 


0.000(16) 


0.000(18) 


0.000(20) 



Table 4.5 presents data similar to Table 4.4 for the associated CPU times 
in seconds on a Sun Enterprise 4000 workstation (12 CPU, 248 MHz). 



Table 4.5. The CPU times for the same cases as in Table 4.4. 



N 


CPUi 


CPU 2 


CPUs 


CPU 4 


CPUs 


CPUs 


200 


3.6 


3.1 


2.9 


3.4 


2.9 


3.4 


600 


7.4 


8.0 


6.9 


8.1 


7.2 


9.1 


800 


15.3 


14.5 


16.9 


15.3 


16.6 


19.8 


1200 


25.4 


23.0 


26.4 


23.1 


26.1 


32.5 


1600 


34.1 


30.0 


27.8 


29.7 


39.7 


42.2 


2000 


35.2 


37.5 


35.1 


37.7 


48.5 


51.3 



Similar results were obtained with Algorithm 4.5.2 for partition problems^ 
that is where instead of the random cut generation Algorithm 4.5.2 we use 
the random bipartition generation Algorithm 4.6.1. 

Table 4.6 represents the evolution of Algorithm 4.5.2 for problem (4.51) 
with n = 200, m = 100, c= 1 and Z = 0. The relative experimental error is 
= 0. It took 30 seconds of CPU time to find the optimal solution. Table 4.7 
presents another evolution of Algorithm 4.5.2 for the max-cut problem as 
functions of t for the following parameters: n = 600 with m = n/10 = 60, 
c =45, Z - U(4.5, 5.0), AT = lOn and ^ = 0.01. 




4.9 Numerical Results for SNNs 159 



Table 4.6. Evolution of Algorithm 4.5.2 for the artificial max-cut problem (4.51) 
with n = 200, m = 100, c = 1, Z = 0, A/^ = 1000, a = 0.7, q = 0.05, d = 5. 



t 






P i^min 


P >^max 


1 


5084 


5256 


0.5140 


0.4860 


2 


5072 


5280 


0.5014 


0.4986 


3 


5108 


5228 


0.5004 


0.4957 


4 


5252 


5476 


0.5073 


0.4948 


5 


5512 


6012 


0.5009 


0.4956 


6 


5874 


6250 


0.5031 


0.4860 


7 


6276 


6682 


0.5040 


0.4943 


8 


6736 


7310 


0.5250 


0.4742 


9 


7232 


8010 


0.5215 


0.4596 


10 


7730 


8330 


0.5579 


0.4365 


11 


8276 


8772 


0.6228 


0.3689 


12 


8696 


9232 


0.7270 


0.3347 


13 


9140 


9512 


0.7641 


0.2964 


14 


9512 


9900 


0.8172 


0.2009 


15 


9800 


10000 


0.8472 


0.0883 


16 


10000 


10000 


0.9542 


0.0265 



Table 4.7. Evolution of Algorithm 4.5.2 for the bipartition problem with n = 600, 
m = 60, c =45, Z ~ U (4. 5, 5.0). CPU time 1729 seconds, e = 0. 



t 


7t 




P 1"min 


P >l<max 


1 


4842820 


551289 


0.500 


0.351 


2 


6032004 


692976 


0.500 


0.485 


3 


7114520 


805967 


0.500 


0.498 


4 


8253897 


924785 


0.514 


0.487 


5 


9247863 


1007224 


0.501 


0.466 


6 


9863917 


1070735 


0.500 


0.467 


7 


1049434 


1113882 


0.584 


0.434 


8 


1113857 


1179822 


0.631 


0.299 


9 


1157656 


1202118 


0.667 


0.250 


10 


1179821 


1292903 


0.684 


0.283 


11 


1224562 


1269969 


0.700 


0.283 


12 


1247190 


1316007 


0.732 


0.265 


13 


1269976 


1339294 


0.750 


0.283 


14 


1292918 


1339277 


0.816 


0.282 


15 


1316022 


1362703 


0.765 


0.415 


16 


1339283 


1386292 


0.581 


0.350 


17 


1362709 


1410035 


0.598 


0.383 


18 


1386295 


1410042 


0.782 


0.299 


19 


1410040 


1433943 


0.848 


0.266 


20 


1433942 


1458000 


0.848 


0.016 




160 4 Combinatorial Optimization via Cross- Entropy 



Table 4.8 presents the dynamics of jpt = . . . ,Pt,i 4 ) for n = 14, m = 

7, c = 5, Z- U(l,5) and AT = 3n = 42. 



Table 4.8. Dynamics of pt forn = 14, m = 7, c = 5, Z ~ U(l, 5) and N = 3n = 42. 



t 


Pt 


0 


1.00 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


0.50 


1 


1.00 


0.25 


0.75 


0.50 


0.75 


0.75 


0.50 


0.75 


0.00 


0.25 


0.75 


0.25 


0.25 


0.50 


2 


1.00 


0.75 


0.75 


0.75 


1.00 


1.00 


1.00 


0.25 


0.00 


0.00 


0.25 


0.00 


0.00 


0.00 


3 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


4 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 



4.9.2 r-Partition 



Here we extend our numerical results to the r-partition problem (r > 2), by 
considering a graph G = (V, E) with the following symmetric cost matrix: 



^11 


Cl2 




Clr 


C 21 


Z 22 




C2r 










Crl 


Cr2 




Zrr 









(4.53) 



We choose the submatrices {Zu} and {Qj} such that the optimal partition 
is given by V* = , K*} with 



Fi* = mi}, 

V2* = {mi + 1 , . . . , mi + m2} , 



r— 1 

Vr* = m* + 1, . . . , n} . 

i=l 




4.9 Numerical Results for SNNs 161 



To insure this, we let all elements of the matrices {Za} be less than all the 
elements of the matrices {Cij}, For simplicity, we assume that all elements of 
the Cij are equal to some constant c and that n integer divisible in r and the 
optimal partition divides V into r subsets of equal size m = nir. In this case 
the optimal value 7* is given by 




We generated the elements of the Zu via a random distribution with sup- 
port on a subset of [0, 6), for b small enough. Tables 4.9 and 4.10 below present 
the evolution of Algorithm 4.5.2 for the r-partition problems with configura- 
tions n = 400, r = 4 and n = 500, r = 5 respectively. We set c = 50 and 
generate the elements of the Za via the U(4.5, 5)-distribution. Note that we 
only need to generate the elements above the diagonal, since (7 is a symmetric 
matrix and the diagonal elements of the Zu are 0. 



Table 4.9. Evolution of Algorithm 4.5.2 for the r-partition problem with n = 400 
nodes, r = 4, 7* = 3000000, q = 0.01, N = 10000, c = 50, and Z ~ U(4.5,5). CPU 
time is 2097 seconds, e = 0. 



t 


7t 




P i^min 


P 4^inax 


1 


2.330840e-h06 


2.338761e-h06 


0.259990 


0.240017 


2 


2.330704e-h06 


2.338354e-h06 


0.259960 


0.240041 


3 


2.332089e-h06 


2.341081e-h06 


0.260016 


0.230034 


4 


2.337166e-h06 


2.348442e-h06 


0.259969 


0.239941 


5 


2.349762e-h06 


2.363804e-h06 


0.270121 


0.230098 


6 


2.372085e-h06 


2.389760e-h06 


0.279922 


0.229938 


7 


2.404083e-h06 


2.435340e+06 


0.289990 


0.209948 


8 


2.445606e+06 


2.473909e-h06 


0.300106 


0.209884 


9 


2.493293e-h06 


2.528921e-h06 


0.329936 


0.179832 


10 


2.550295e-h06 


2.574981e-h06 


0.299978 


0.189747 


11 


2.611895e-h06 


2.642337e-h06 


0.310163 


0.150062 


12 


2.678021e-|-06 


2.713952e-h06 


0.350111 


0.120135 


13 


2.743622e-h06 


2.780953e-h06 


0.460127 


0.119892 


14 


2.806880e-h06 


2.841077e-h06 


0.490120 


0.079962 


15 


2.867101e-h06 


2.895819e-h06 


0.479967 


0.059933 


16 


2.918022e+06 


2.950955e+06 


0.669903 


0.030030 


17 


2.960013e-h06 


2.977818e-h06 


0.790066 


0.019997 


18 


2.990928e-h06 


3.000000e-h06 


0.939996 


0.009996 


19 


3.000000e-h06 


3.000000e4-06 


1.000000 


0.000000 


20 


3.000000e-h06 


3.000000e-h06 


1.000000 


0.000000 


21 


3.000000e+06 


3.000000e-h06 


1.000000 


0.000000 




162 4 Combinatorial Optimization via Cross-Entropy 



Table 4.10. Evolution of Algorithm 4.5.2 for the r-partition problem with n = 500 
nodes, r = 5, y* = 5000000, g = 0.01, N = 20000, c 50, and Z ~ U(4.5, 5). CPU 
time 9557 seconds, s = 0.00091. 



t 


It 


St,(N) 


P Tmin 


P >i-max 


1 


4.105605e+06 


4.112836e-f06 


0.205009 


0.194994 


2 


4.105501e+06 


4.115174e+06 


0.205005 


0.194993 


3 


4.105646e+06 


4.114378e+06 


0.209992 


0.189999 


4 


4.106427e+06 


4.113426e+06 


0.215033 


0.184996 


8 


4.163015e+06 


4.189215e-h06 


0.215004 


0.179973 


9 


4.197580e+06 


4.222975e+06 


0.205025 


0.185003 


10 


4.239794e+06 


4.268014e-f06 


0.230000 


0.150036 


11 


4.286359e+06 


4.318551e+06 


0.244958 


0.149990 


14 


4.446925e-f06 


4.485072e+06 


0.245036 


0.139987 


15 


4.503736e+06 


4.541756e+06 


0.259982 


0.135044 


16 


4.558667e+06 


4.590573e+06 


0.300069 


0.104966 


17 


4.610398e+06 


4.638465e+06 


0.335027 


0.090008 


18 


4.658267e+06 


4.693343e+06 


0.349934 


0.094989 


19 


4.699643e+06 


4.716176e+06 


0.334942 


0.054996 


23 


4.812588e-h06 


4.832686e-l-06 


0.494989 


0.010006 


24 


4.848739e+06 


4.869789e+06 


0.500067 


0.005003 


25 


4.887795e+06 


4.911411e+06 


0.494996 


0.005000 


26 


4.929371e-h06 


4.957022e+06 


0.579969 


0.000000 


27 


4.965013e+06 


4.982266e+06 


0.714995 


0.000000 


28 


4.986607e+06 


4.995472e+06 


0.630120 


0.000000 


29 


4.995472e+06 


4.995472e-h06 


0.995000 


0.000000 


30 


4.995472e+06 


4.995472e+06 


0.995000 


0.000000 


31 


4.995469e-h06 


4.995469e+06 


0.995000 


0.000000 


32 


4.995469e+06 


4.995469e+06 


0.995000 


0.000000 


33 


4.995469e+06 


4.995469e+06 


1.000000 


0.000000 


34 


4.995469e+06 


4.995469e+06 


1.000000 


0.000000 



4.9.3 Multiple Solutions 

To investigate the convergence behavior of Algorithm 4.5.2 when the max-cut 
problem has multiple solutions, we ran a modified version of the synthetic 
max-cut problem. In this problem we select, deterministically or at random, 
k vertices from both Vi and V 2 and set the weights of all edges that are 
incident to these nodes equal to c. We obtain in this way different cuts 
with the same value of objective function. 

Tables 4.11-4.12 present the evolution of Algorithm 4.5.2 for a bipartition 
problem with 70 and 252 multiple solutions, respectively. The problem in- 
stances were generated from the same symmetric matrix (4.51) corresponding 




4.9 Numerical Results for SNNs 163 



to a single optimal solution, but using the above modification with edges of 
equal weights. We set n = 300, c = 50, Z ~ U(4.5, 5.0), and m = 150. 



Table 4.11. Evolution of Algorithm 4.5.2 with 70 multiple solutions. CPU time 956 
seconds, £ = 0. 



t 


It 




P 1^min 


P 4^max 


1 


653664.2 


665978.9 


0.500024 


0.499776 


2 


656880.4 


670476.3 


0.500003 


0.499974 


3 


677162.1 


690689.7 


0.500017 


0.499891 


14 


802374.3 


821960.2 


0.500893 


0.498803 


15 


806397.3 


834234.2 


0.500335 


0.499993 


16 


813988.4 


855626.5 


0.500083 


0.499733 


17 


822138.5 


846929.7 


0.500276 


0.499090 


28 


873720.7 


906628.7 


0.501069 


0.499286 


29 


873559.5 


911553.8 


0.501170 


0.499558 


30 


878122.9 


901982.4 


0.500080 


0.467794 


31 


882951.0 


936865.4 


0.531554 


0.499926 


42 


942160.0 


968903.0 


0.566601 


0.433407 


43 


947523.2 


968873.2 


0.500092 


0.499161 


44 


947539.0 


968891.8 


0.501017 


0.499897 


45 


947540.4 


979567.9 


0.532519 


0.467096 


56 


996443.3 


1.019435e+06 


0.567372 


0.499518 


57 


1.002278e+06 


1.049345e+06 


0.566760 


0.466034 


58 


1.007850e+06 


1.031200e+06 


0.533685 


0.466328 


59 


1.013776e+06 


1.037309e+06 


0.566563 


0.499329 


69 


1.111974e+06 


1.118444e+06 


0.599846 


0.466461 


70 


1.118439e+06 


1.125000e+06 


0.666536 


0.466771 


71 


1.118439e+06 


1.125000e+06 


0.500000 


0.399687 


72 


1.118441e+06 


1.125000e+06 


0.566504 


0.499707 


73 


1.118441e+06 


1.125000e+06 


0.533463 


0.499415 


74 


1.118441e+06 


1.125000e+06 


0.599805 


0.466601 


75 


1.118441e+06 


1.125000e+06 


0.599883 


0.399922 


76 


1.118441e+06 


1.125000e+06 


0.599727 


0.333138 




164 4 Combinatorial Optimization via Cross-Entropy 



Table 4.12. Evolution of Algorithm 4.5.2 with 252 multiple solutions. CPU time 
896 seconds, e = 0. 



t 


It 




P 1^min 


P 4^max 


1 


659664.5 


671071.7 


0.500081 


0.499985 


2 


660285.0 


671062.4 


0.500001 


0.499990 


3 


659391.4 


667080.0 


0.500036 


0.499938 


14 


797840.1 


832764.9 


0.531047 


0.499741 


15 


808965.4 


841014.6 


0.533229 


0.499118 


16 


820559.0 


849247.3 


0.500247 


0.499749 


17 


824540.7 


845068.6 


0.500101 


0.466964 


28 


880209.3 


918209.4 


0.500678 


0.499172 


29 


884915.8 


903727.5 


0.532711 


0.499987 


30 


889705.4 


923102.4 


0.532683 


0.499826 


31 


889709.6 


923114.4 


0.532786 


0.497970 


42 


953961.2 


997393.5 


0.500081 


0.499887 


43 


964531.6 


991684.7 


0.533474 


0.466147 


44 


969800.5 


1.002924e-f06 


0.531970 


0.432870 


45 


970066.6 


1.014588e-h06 


0.532278 


0.499721 


56 


1.068346e-h06 


1.092966e-h06 


0.799371 


0.432514 


57 


1.086535e-h06 


1.099478e-h06 


0.733773 


0.299485 


58 


1.099478e-h06 


1.112058e-h06 


0.699533 


0.033486 


59 


1.105727e-h06 


1.112064e-h06 


0.533270 


0.000000 


60 


1.112058e-h06 


1.118485e-h06 


0.599769 


0.433359 


65 


1.112062e-h06 


1.118488e-h06 


0.533679 


0.299942 


66 


1.118483e-h06 


1.125000e-h06 


0.566751 


0.233288 


67 


1.118486e-h06 


1.125000e-h06 


0.633476 


0.200116 


68 


1.118486e+06 


1.125000e-f-06 


0.666279 


0.166667 


69 


1.118486e+06 


1.125000e-h06 


0.666473 


0.166861 


70 


1.118486e-h06 


1.125000e-h06 


0.666602 


0.166505 


71 


1.118486e-h06 


1.125000e+06 


0.566150 


0.100000 


72 


1.118486e+06 


1.125000e+06 


0.666667 


0.333333 



We see in Tables 4.11 and 4.12 that, although in both cases the optimal 
value 1.125* 10® is found, the algorithm takes many iterations to converge and 
seems to oscillate between multiple optimal solutions, as illustrated by the 
fluctuating behavior of the vector p. It is interesting to note that eventually 
Algorithm 4.5.2 settles down in one of the optimal degenerated solutions, 
provided we increase the parameter d in the stopping rule (4.25). To illustrate 
this, Table 4.13 presents the evolution of Algorithm 4.5.2 for a bipartition 
problem with n = 20 nodes that has 20 different optimal solutions. We take 




4.9 Numerical Results for SNNs 165 



m = 10, ^ = 0.03, N = 200, c = 50, Z ~ U(4.5,5.0) and increase d until 
the degeneracy of p occurs. In this particular example we found that the 
degeneracy of p occurs for d = 13. 



Table 4.13. Evolution of Algorithm 4.5.2 for d = 13. CPU time 0.2 seconds, £ = 0. 



t 


It 


St,(N) 


P 7min 


P >l^max 


1 


4591.810 


4774.041 


0.500792 


0.497569 


2 


4548.082 


4774.165 


0.500033 


0.499967 


3 


4411.859 


4773.491 


0.657805 


0.490885 


4 


4773.580 


4774.041 


0.666662 


0.499999 


5 


5000.000 


5000.000 


0.666667 


0.333333 


6 


5000.000 


5000.000 


0.666667 


0.333333 


7 


5000.000 


5000.000 


0.666667 


0.333333 


8 


5000.000 


5000.000 


0.666667 


0.333333 


9 


5000.000 


5000.000 


0.666667 


0.333333 


10 


5000.000 


5000.000 


0.666667 


0.166667 


11 


5000.000 


5000.000 


0.666667 


0.000000 


12 


5000.000 


5000.000 


0.833333 


0.000000 


13 


5000.000 


5000.000 


0.833333 


0.166667 


14 


5000.000 


5000.000 


1.000000 


0.000000 


15 


5000.000 


5000.000 


1.000000 


0.000000 


16 


5000.000 


5000.000 


1.000000 


0.000000 


17 


5000.000 


5000.000 


1.000000 


0.000000 


18 


5000.000 


5000.000 


1.000000 


0.000000 



Table 4.14 presents the dynamics of pt for the same data as in Table 4.13. 
Figures 4.6 and 4.7 present the bar charts based on data of Table 4.14. 



Table 4.14. Dynamics of pt for the same data as in Table 4.13. 



t I El 



0 


1.0 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


0.5 


1 


1.0 


0.7 


0.5 


0.7 


0.5 


0.5 


0.7 


0.7 


0.5 


0.3 


0.5 


0.7 


0.3 


0.5 


0.7 


0.5 


0.5 


0.3 


0.5 


0.2 


2 


1.0 


0.5 


0.3 


0.5 


0.7 


0.5 


0.7 


0.3 


0.8 


0.5 


0.5 


0.3 


0.5 


0.7 


0.7 


0.5 


0.7 


0.5 


0.5 


0.0 


3 


1.0 


0.8 


0.8 


0.8 


0.7 


0.8 


0.7 


0.0 


0.7 


0.2 


0.2 


0.3 


0.7 


0.8 


0.7 


0.5 


0.3 


0.3 


0.2 


0.0 


4 


1.0 


1.0 


1.0 


0.8 


1.0 


0.7 


1.0 


0.0 


0.5 


0.7 


0.3 


0.5 


0.8 


0.5 


0.0 


0.3 


0.0 


0.0 


0.0 


0.0 


5 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


0.7 


0.0 


0.7 


0.3 


0.3 


0.7 


0.8 


0.5 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


6 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


0.7 


0.0 


0.8 


0.3 


0.3 


0.7 


0.8 


0.3 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


7 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


0.5 


0.0 


0.8 


0.7 


0.3 


0.7 


0.7 


0.3 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


8 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


0.5 


0.0 


0.5 


0.5 


0.8 


0.7 


0.7 


0.3 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


9 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


0.5 


0.0 


0.7 


0.5 


1.0 


0.8 


0.2 


0.3 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


10 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


0.7 


0.0 


0.2 


0.8 


0.7 


1.0 


0.2 


0.5 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


11 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


0.7 


0.0 


0.0 


1.0 


0.8 


1.0 


0.0 


0.5 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


12 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


0.0 


0.0 


0.8 


0.8 


0.8 


0.0 


0.5 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


13 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


0.0 


0.0 


0.8 


1.0 


1.0 


0.0 


0.2 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


14 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


1.0 


0.0 


0.0 


1.0 


1.0 


1.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 




4 Combinatorial Optimization via Cross-Entropy 






4.9 Numerical Results for SNNs 



167 





Fig. 4.7. The charts of pt as function of t (part b). 



It can be readily seen that Algorithm 4.5.2 oscillates between iterations 5 
and 13 from one optimal solution to another, but after iteration 13 it settles 






168 4 Combinatorial Optimization via Cross-Entropy 



down in one of the true 20-dimensional (degenerated) solutions. The conver- 
gence properties of Algorithm 4.5.2 in the presence of multiple solution is still 
an open problem. 

4.9.4 Empirical Computational Complexity 

Let us finally discuss the computational complexity of Algorithm 4.5.2 for the 
max-cut and the partition problems, which can be defined as 

= Tn{NnGn + Un) • (4.55) 

Here Tn is the total number of iterations needed before Algorithm 4.5.2 stops; 
Nn is the sample size, that is, the total number of maximal cuts and partitions 
generated at each iteration; Gn is the cost of generating the random Bernoulli 
vectors of size n for Algorithm 4.5.2; Un = 0{Nn'n?) is the cost of updating 
the tuple (7t,Pt). The latter follows from the fact that computing 5(X) in 
(4.21) is a O(n^) operation. 

For the model in (4.51) we found empirically that Tn = O(lnn), provided 
100 <n< 1000. For the max-cut problem, considering that we take n < Nn < 
10 n and that Gn is 0(n) , we obtain K,n = Inn). In our experiments, the 
complexity we observed was more like 

i^n = O(nlnn) . 

The partition problem has similar computational characteristics. It is impor- 
tant to note that these empirical complexity results are solely for the model 
with the cost matrix (4.51). 

4.10 Numerical Results for SENs 

Below we present the results of numerical experiments involving CE Algo- 
rithm 4.7.3 for the TSP and QAP. We note that various numerical results for 
the TSP have already been given in Section 2.5.2, including several bench- 
mark TSP examples. The notation in the tables will be similar to that for 
SNN experiments in Table 4.3. If not stated otherwise we set the parameters 
as usual (see also Remark 4.4): g = 10“^, a = 0.7, W = 5 • and use the 
stopping rule (4.25) with the parameter d = 5. To indicate the convergence 
of Pt to the ODTM P* we introduce the following quantities 

pmm _ ^ (4.56) 

t = 1, 2, . . ., which corresponds to the min max value of the elements of the 
matrix Pt^ij at iteration t, and, similarly, the max min value 

Pt,mm = max min pt,ij • 

l<z<n l<j<n 



(4.57) 




4.10 Numerical Results for SENs 169 



Observe that = 1 if and only if Pt = P*. In the tables denotes 

the smallest order statistic of the performances in iteration t; in other words, 
the length of the smallest tour obtained in iteration t. 



4.10.1 Synthetic TSP 

Since the TSP is an NP-hard problem, no efficient exact methods are available 
to verify the accuracy of our method for large n. Total enumeration of all the 
paths is suitable only for small networks, say with n = 10 cities containing 
9! = 362,880 trajectories. To still assess the accuracy and speed of the CE 
algorithm for a large TSP we construct the following artificial problem for 
which the solution is available in advance: Let the cost matrix C be such that 
= 1, for alH = 1, 2, . . . , n — 1 and Cn,i = 1, while the remaining elements 
are Cij ~ U(a, 6), j ^ i + 1, 6 > a, a > 1 and bu = 0. 

It follows that in this case the optimal permutation/tour is given by x* = 
(1, 2, 3, . . . , n), corresponding to the sequence of visits 1 -> 2 -> 3 ^ ^ 

n 1. The corresponding minimal value of the tour is 

n— 1 

7 * = ^ + Cn,i = n . (4.58) 

i=l 

Table 4.15 presents a typical evolution of the CE Algorithm 4.7.3 for n = 
30. The relative experimental error of the final solution was e = 0.0019. The 
algorithm was implemented in Matlab and took approximately 45 seconds to 
converge on a 1.4GHz processor with 256M RAM. 



Table 4.15. A typical evolution of Algorithm 4.7.3 for the synthetic TSP with 
n = 30, d = 5, ^ = 0.01, N = 4500, and a = 0.7. e = 0.0019. 



t 


7t 




pmm 


1 


40.73 


39.12 


0.0722 


2 


38.19 


36.83 


0.0994 


3 


36.09 


34.58 


0.1218 


4 


34.63 


32.92 


0.1268 


5 


33.64 


32.37 


0.1719 


6 


32.89 


31.80 


0.1493 


7 


32.11 


31.17 


0.2208 


8 


31.54 


30.74 


0.3162 


9 


31.22 


30.13 


0.2704 


10 


30.97 


30.39 


0.2519 


11 


30.79 


30.14 


0.3211 


12 


30.68 


30.00 


0.2498 


13 


30.29 


30.00 


0.2754 



The dynamics of this particular problem including an illustration of the 
convergence can be found in Section 2.5.2. 




170 4 Combinatorial Optimization via Cross-Entropy 



4.10.2 Multiple Solutions 

We consider the behavior of Algorithm 4.7.3 for the TSP in the presence 
of multiple solutions, similar to Section 4.9.3 for the max-cut problem. In 
order to generate multiple solutions we first generate our synthetic model of 
Section 4.10.1. We then select deterministically or at random m different tours 
and assign to all edges in the tour the cost 1 . As a consequence, in addition 
to the tour 1 ^ 2 -> 3 ^...^n->l there will be m other tours that have 
the minimal length 7 * = n. Performing simulation studies for this model we 
found that Algorithm 4.7.3 in most cases finds one of the optimal solutions 
7 * in a finite number of iterations. 

Clearly, in the case of multiple solution one needs to use the stopping rule 
associated with the level 7 * rather than the stopping rule associated with the 
probability matrix P^. The convergence proof of Algorithm 4.7.3 in this case 
is still an open problem. 

Tables 4.16-4.18 present the evolution of the CE algorithm for a multi- 
extremal TSP with 40 nodes and 10 optimal solutions. 



Table 4 . 16 . Multi-extremal synthetic TSP, Cij ~ U(l. 0,10.0), 40 nodes, 7* = 40, 
Q = 0.01, N = 8000, a = 0.90, 10 optimal tours. CPU time 58 seconds, s = 0.000. 



t 


7t 


pmm 


Pt,mm 


‘^'*,(1) 


1 


139.0758 


0.058043 


0.002564 


113.7909 


2 


103.6459 


0.085165 


0.000256 


81.79360 


3 


77.32870 


0.098468 


0.000026 


58.98756 


4 


61.26692 


0.122118 


0.000003 


49.89649 


5 


52.36412 


0.128535 


0.000000 


45.38383 


6 


47.44017 


0.165937 


0.000000 


42.24399 


7 


44.65781 


0.151682 


0.000000 


40.92816 


8 


42.94660 


0.180555 


0.000000 


40.77399 


9 


42.22027 


0.156351 


0.000000 


40.00000 


10 


41.54227 


0.159557 


0.000000 


40.00000 


11 


41.07858 


0.154678 


0.000000 


40.08350 


12 


40.89977 


0.145107 


0.000000 


40.00000 


13 


40.82557 


0.147619 


0.000000 


40.00000 


14 


40.64345 


0.166954 


0.000000 


40.00000 


15 


40.62949 


0.149387 


0.000000 


40.00000 


16 


40.49546 


0.144624 


0.000000 


40.00000 


17 


40.38665 


0.138180 


0.000000 


40.00000 




4.10 Numerical Results for SENs 171 



Table 4.17. Multi-extremal synthetic TSP, Cij ~ U(l. 0,10.0), 40 nodes, 7* = 40, 
g = 0.01, N = 8000, a = 0.90, 25 optimal tours. CPU time 31 seconds, e = 0.000. 



t 


7t 


pmm 


Pt,mm 




1 


95.86915 


0.049209 


0.002564 


71.78275 


2 


65.69504 


0.082939 


0.000256 


53.80846 


3 


48.94862 


0.084247 


0.000026 


44.08474 


4 


42.11559 


0.101557 


0.000003 


40.00000 


5 


40.11133 


0.099725 


0.000000 


40.00000 


6 


40.00000 


0.112419 


0.000000 


40.00000 


7 


40.00000 


0.112296 


0.000000 


40.00000 


8 


40.00000 


0.103212 


0.000000 


40.00000 


9 


40.00000 


0.108986 


0.000000 


40.00000 



Table 4.18. Multi-extremal synthetic TSP, Cij ~ U(l. 0,10.0), 40 nodes 7* = 40, 
g = 0.01, N = 8000, a = 0.90, 50 optimal tours. CPU time 24 seconds, e = 0.000. 



t 


7t 


pmm 


Pt,mm 




1 


59.92559 


0.050516 


0.002564 


45.73649 


2 


43.07207 


0.070143 


0.000256 


40.00000 


3 


40.00000 


0.073504 


0.000026 


40.00000 


4 


40.00000 


0.075902 


0.000003 


40.00000 


5 


40.00000 


0.097586 


0.000000 


40.00000 


6 


40.00000 


0.099463 


0.000000 


40.00000 


7 


40.00000 


0.097126 


0.000000 


40.00000 



It is interesting to note that before Algorithm 4.7.3 settles down into a 
particular 7* it oscillates for a number of iterations from one P* to another 
P*, although it does not appear to result in an increase in number of iterations 
or in CPU time. Moreover, the larger the subset of optimal solutions, the 
smaller is the eflPort to find one of them. 



4.10.3 Experiments with Sparse Graphs 

Up to now we considered experiments with fully connected graphs. Yet many 
problems are defined on cost matrices that are not fully connected and even 
sparse (many entries are set to 00). 

Next, we present numerical results for such problems where the original 
cost matrix generation is slightly modified to produce “sparse” random cost 
matrices. We denote by ^cost the density of a cost matrix, that is, the fraction 
of active edges relative to the total number of edges of fully connected matrix. 
To obtain a TSP case with ^cost < 1 we first generate our basic cost matrix 
C as described in Section 4.10.1. Next, for a given ^cost < 1-0 we replace at 
random (1.0 — ^cost) x 100 of the generated Cij with infinities. Thus we obtain 
a cost matrix with density ^cost- Tables 4.19-4.21 present the evolution of the 
CE algorithm for the deterministic TSP with Pcost < 1-0. 




172 4 Combinatorial Optimization via Cross-Entropy 



Table 4.19. Synthetic TSP, Cij ~ U(1.0, 10.0), 30 nodes, 7* = 30, ^ = 0.01, N 
4500, a = 0.90, ^cost = 0.60. CPU time 20 seconds, e = 0.000. 



t 


It 


ryTnm 


Pt,mm 


■S't.d) 


1 


730.3994 


0.079062 


0.003448 


440.3003 


2 


345.7853 


0.107328 


0.000345 


134.3467 


3 


142.7576 


0.114207 


0.000034 


115.4077 


4 


116.8807 


0.148245 


0.000003 


94.46148 


5 


96.78283 


0.202041 


0.000000 


78.22054 


6 


83.67821 


0.221003 


0.000000 


69.11775 


7 


73.48995 


0.202565 


0.000000 


57.08287 


8 


65.36932 


0.209768 


0.000000 


50.82449 


9 


57.85579 


0.290023 


0.000000 


43.66945 


10 


51.21792 


0.292054 


0.000000 


41.83288 


11 


46.08596 


0.310255 


0.000000 


38.35064 


12 


40.20582 


0.386022 


0.000000 


33.35134 


13 


35.56522 


0.410951 


0.000000 


30.00000 


14 


30.00000 


0.921320 


0.000000 


30.00000 


15 


30.00000 


0.992132 


0.000000 


30.00000 


16 


30.00000 


0.999213 


0.000000 


30.00000 


17 


30.00000 


0.999921 


0.000000 


30.00000 


18 


30.00000 


0.999992 


0.000000 


30.00000 



Table 4.20. Synthetic TSP, Cij ^ U(1.0, 10.0), 40 nodes, 7* = 40, ^ = 0.01, N 
8000, a = 0.90, ^cost = 0.50. CPU time 84 seconds, e = 0.000. 



t 


7* 


pmm 


Pt,mm 




1 


1432.600 


0.048886 


0.002564 


1083.389 


2 


850.0822 


0.071265 


0.000256 


481.6481 


3 


453.7181 


0.099729 


0.000026 


276.6252 


4 


204.5736 


0.094775 


0.000003 


167.9254 


5 


171.2220 


0.120172 


0.000000 


144.3944 


6 


148.9051 


0.137320 


0.000000 


123.5032 


11 


91.11670 


0.182737 


0.000000 


75.15271 


12 


85.11291 


0.208025 


0.000000 


63.81974 


13 


78.54369 


0.225161 


0.000000 


60.25669 


14 


73.96963 


0.214676 


0.000000 


55.54310 


19 


47.37309 


0.505358 


0.000000 


40.00000 


20 


42.05268 


0.499078 


0.000000 


40.00000 


21 


40.00000 


0.936332 


0.000000 


40.00000 


22 


40.00000 


0.993633 


0.000000 


40.00000 


23 


40.00000 


0.999363 


0.000000 


40.00000 


24 


40.00000 


0.999936 


0.000000 


40.00000 




4.10 Numerical Results for SENs 173 



Table 4.21. Synthetic TSP, Cij ~ U(1.0, 10.0), 40 nodes, 7* = 40, ^ = 0.01, N = 
8000, a = 0.90, ^cost = 0.30. CPU time 98 seconds, e = 0.000. 



t 


It 


TymTn 


Pt,mm 


St,w 


1 


2115.702 


0.058218 


0.002564 


1630.689 


2 


1456.354 


0.093351 


0.000256 


1092.379 


3 


957.4554 


0.109995 


0.000026 


567.8778 


4 


570.2860 


0.124202 


0.000003 


304.9498 


5 


365.9991 


0.141733 


0.000000 


178.9949 


6 


261.9922 


0.139430 


0.000000 


158.5351 


7 


182.4486 


0.128288 


0.000000 


138.5457 


8 


168.1792 


0.130095 


0.000000 


135.2442 


9 


155.0441 


0.136634 


0.000000 


122.6527 


10 


147.4868 


0.164139 


0.000000 


118.4093 


11 


137.9609 


0.156376 


0.000000 


106.2428 


12 


132.3045 


0.171164 


0.000000 


99.46488 


13 


127.3059 


0.187035 


0.000000 


100.9402 


14 


122.5778 


0.160873 


0.000000 


96.32555 


15 


117.6652 


0.149974 


0.000000 


87.01289 


16 


112.2542 


0.155534 


0.000000 


76.07462 


17 


104.4893 


0.217362 


0.000000 


80.15214 


18 


99.34178 


0.221301 


0.000000 


78.23600 


19 


94.08737 


0.173795 


0.000000 


65.02988 


20 


88.06668 


0.208399 


0.000000 


61.89260 


21 


80.64057 


0.197207 


0.000000 


58.10715 


22 


75.45361 


0.276194 


0.000000 


47.93998 


23 


68.30934 


0.359635 


0.000000 


40.00000 


24 


55.84366 


0.616132 


0.000000 


40.00000 


25 


40.00000 


0.961613 


0.000000 


40.00000 


26 


40.00000 


0.996161 


0.000000 


40.00000 


27 


40.00000 


0.999616 


0.000000 


40.00000 


28 


40.00000 


0.999962 


0.000000 


40.00000 



4.10.4 Numerical Results for the QAP 

Table 4.22 presents the performance of Algorithm 4.7.3 for N = 15 for a 
number of quadratic assignment problem case studies, taken from the URL 
www.iimn.dtu.dk/~sk/qaplib. Here we use the same notations as in Table 
2.5 for the TSP, with the exception that 7 ^ and 7 ^ denote the best of the 
10 solutions after the first and the final iteration. Note that for n = 26 we 
obtained a solution better than the best known one. 




174 4 Combinatorial Optimization via Cross-Entropy 



Table 4.22. Case studies for QAP. 



file 


n 


T 


7i 


It 


7* 


e 




e* 


CPU 


hadl2 


12 


33 


1772 


1666 


1652 


0.012 


0.018 


0.008 


4.1 


hadl4 


14 


37 


2964 


2743 


2724 


0.014 


0.019 


0.007 


6.5 


hadl6 


16 


79 


4020 


3736 


3720 


0.006 


0.009 


0.004 


15.2 


had 18 


18 


112 


5748 


5412 


5358 


0.016 


0.023 


0.010 


46.3 


had20 


20 


94 


7474 


7022 


6920 


0.020 


0.027 


0.014 


81.8 


bur26a 


26 


172 


5547428 


5346546 


5426670 


-0.011 


0.002 


-0.015 


240.5 


thoSO 


30 


64 


157932 


153998 


149936 


0.045 


0.060 


0.027 


610 



4.11 Appendices 



4.11.1 Two Tour Generation Algorithm for the TSP 

Algorithm 4.7.1 specifies how we can generate a random tour/permutation 
X = (Xi, . . . , Xn) for the TSP. Here is an alternative, but equivalent, repre- 
sentation (see Section 4.4, in particular the first paragraph, and [145]). Define 
^ij {hj C {!,..., n}) as: 

„ _ J 1, if edge (i, j) belongs to the current tour, , . 

~ 1o, otherwise. 



Obviously, if Xij* = 1, we can conclude that Xij = 0, for j ^ j*. The 
following algorithm, due to Margolin [113], gives a detailed implementation 
of Algorithm 4.7.1 when tours are represented by the collection X = {Xij}. 
Let the probability pij correspond to a transition from node i to node j. 

Algorithm 4.11.1 (Node Transition Algorithm) 

1. Define the probability matrix P: 



/ 0 Pi2 . . . Pin \ 
P21 0 ...P2n\ 



\Pnl Pn2 • • • 0 / 

(Assume, that all pij ^ 0, for i ^ j-) 

2. Set counter = 1, U = {1}, i = 1. 

3. Generate a next node J from the following distribution: 



(4.60) 




4.11 Appendices 175 



F{J = j) = 



Pii, J = 1 

Pi2, j = 2 

^23, J = 3 



Assume that J has received value j*. Define 



Xij = 



4- Set U = U U {j*}, counter = counter + 1, i = j* . 

5. If {counter < n — 1) update the i-th row of the matrix: 



for j = 1 to n 

if {j i) o^rid {j e U) 

sum = sum + pij 

Pij = 0 
end if 
end for 
for j = I to n 

Pij =Pij/{l- sum) 
end if 
end for 

Go to Step 3. 

6. Set {j*} = V \ U (at this step U contains exactly n—1 nodes), Xij* = 1. 

Recall that we can generate trajectories either via node transitions (Al- 
gorithms 4.7.1 and 4.11.1) or via node placements, as in Algorithm 4.7.2. 
Similar to Algorithm 4.11.1 we next present a more detailed description of 
Algorithm 4.7.2, using the following representation: Define 

^ _ / 1? fhe node i is arranged to the place j in the current tour, / . x 



\ 0, otherwise. 

Obviously, if Yij* = 1, then Yij = 0 (for j ^ j*). Let the probability P(ij) 
correspond to the following event: the node i stands on the j-th place in the 
permutation. 

Algorithm 4.11.2 (Node Placement Algorithm) 

1. Define the probability matrix P: 

( P{1,1) P{1,2) • • • P(l,n) \ 

^ _ P(2,l) P{2,2) • • • P(2,n) 

\P{n,l) P{n,2) • • • P{n,n) / 



(4.62) 




176 4 Combinatorial Optimization via Cross-Entropy 



(Assume, that all ^ 0.) 

2. SetU = iD, i = 1. 

3. Generate the random variable J from the following distribution: 



P(J = i)- 



P{i,l ) ) 


i = 1 , 




i = 2 , 


P(i,3), 


i = 3, 


P(i,n ) ) 


j = n . 



Assume that J has received value j*. Define 

Y. 



4. SetU = UU{r}, i = i + l. 

5. Update the i-th row of the matrix: 

sum = 0 
for j = 1 to n 
ifijeu) 

sum = sum + P(ij) 

end if 
end for 
for j = 1 to n 

P{i,j) = - sum) 

end if 
end for 

If i < n go to Step 3. 

6. Set {j*} = V\U (at this step U contains exactly n — 1 nodes), Ynj* = 1. 

As noted in Remark 4.12 we can use the alias method to speed up tra- 
jectory generation. In this case an initial setup and extra storage is required, 
but since the alias method is very fast it is preferred to the inverse-transform 
and the acceptance-rejection methods for large dimensions. For the TSP and 
problems with a similar solution generation we have used a combination of 
the alias, acceptance-rejection, and inverse-transform methods as described 
below. 

Note that our goal is to generate a permutation of n numbers using the 
probability matrix P. For simplicity consider Algorithm 4.11.2, that is, the 
z-th number in a permutation is generated using the z-th row from P. We 
propose to generate the first x% of the numbers using a combination of the 
alias method and the acceptance-rejection method. The remaining numbers 




4.11 Appendices 177 



are generated using the inverse-transform method. Let Alias{i, P) denote a 
random number generated by the alias method using the z-th row from P. 
This leads to the following modification of Algorithm 4.11.2. 

Algorithm 4.11.3 (Faster Trajectory Generation via Alias) 

C/ = 0 

for i = 1 to [xn] 
repeat 

j* = Alias{i, P) 
until j* 

Yir = 1 

U = UU{f} 

end for 

for i = [xn] -hi to n — 1 

Update the i-th row of P as in Step 5 of Algorithm 
Generate j* as in Step 3 of Algorithm 4-1U2 

Yij* = 1 

u = uu{r} 

end for 

Determine the final j* from {j*} = V\ U . Set Ynj* = 1. 

One should not take x too large, because for large values of x (that is, close 
to 100%), the algorithm will, as it progresses, generate too many rejections for 
each acceptance. On the other hand, since the alias method is much faster than 
the inverse-transform method, x should not be too small either. It has been 
found numerically that the above algorithm performs best with x ^ 70 — 80%. 

4.11.2 Speeding up Trajectory Generation 

Notice that each trajectory of Algorithms 4.7.1 and 4.7.2 requires n—1 times 
the generation from n-point discrete pdf, corresponding to the rows of the 
matrices P^^^ . This is clearly time consuming. To overcome this difficulty we 
shall present next a fast and simple online procedure for trajectory generations 
for SEN having in mind the TSP. This procedure will employ the well-known 
composition technique^ see Section 1.7.3. 

Consider Algorithm 4.7.1. Assume as before that P = (pij) is given and 
that after k—th transitions we arrived at some city (state) Xk, {k < n—1). Let 
Xi,. . . ,Xk-i be the previously selected cities. In Algorithm 4.7.1 we select 
the (A: -h l)-st city Xk-\-i by sampling from the n-point distribution formed by 
the Xk-th row of matrix which is simply the X^-th row of the original 

P with all probabilities corresponding to the cities Xi,...,Xfc set to zero, 
and the rest properly normalized. Let us denote this probability vector from 
which we sample by p = (pi, . . . ,pn)* Our goal is to show how to sample more 
efficiently from p using the new approach. 




178 4 Combinatorial Optimization via Cross- Entropy 

First, as observed before, k cities have been “marked,” corresponding to 
the already generated part of the tour. We shall call the associated k and n — k 
elements of p, the passive and active elements of p. 

The main idea of the new approach is to 

1. Divide the unmarked cities into three groups, denoted = 1,2,3 such 

that each pi with i G X(l) is < ^, each pi with i G X{2) is between ^ and 
^ and each pi with i G X(3) is > ^. 

2. Approximate the probabilities for group X(l) and X(2) by making them 
equal within each group, while the elements of the third group (with larger 
values) remain untouched. 

3. Generate the desired city using a simple pdf (see (4.64)), which is 
based on the approximated distributions. 

As we shall see below such a procedure will speed up substantially the trajec- 
tory generation while affecting (due to the approximation of probabilities of 
the cities in the first two groups) very little the accuracy of the CE algorithm. 

For j = 1,2,3, let tj denote the number of elements in each group of 
unmarked cities X(j). We then proceed as follows. 

• Calculate 

i = l,2,3. (4.63) 

Note that since the probabilities for the marked cities are 0, we have that 
+ (^2 + — 1* 

Associate the ordered elements in X(l) with {1, . . . ,ri}, the ordered ele- 
ments in X{2) with {ti -h 1, . . . , ti -h X 2 }, and the ordered elements in X(3) 
with {ti + T 2 + 1, . . . , n - A:}. 

• Define a new discrete pdf (/? on {1, 2, 3} with ip{j) = j = 1,2, 3. 

• Define two new discrete pdfs, hi{x) and /i 2 (x), which are uniformly dis- 
tributed on {1, . . . , Ti} and on {ti -f 1, ti +T 2 }, respectively. These are the 
pdfs of the approximated probabilities in the first and second group. 

• Define a new rs-point discrete pdf hs{x) , on {ti + T 2 1, . . . , n — k} that is 
associated with the original probabilities of the cities in the third group. 

• Based on the four discrete pdfs above {hi{x), i = 1,2, 3, and (p{y)) define 
the following pdf 



3 3 

= E = X] • (4-64) 

2=1 2=1 



The pdf (j){x) will be used to generate the next city using the composition 
method; see Section 1.7.3. The associated algorithm is as follows: 




4.11 Appendices 179 



Algorithm 4.11.4 (Fast Algorithm for Trajectory Generation) 

1. Generate a random variate Y with the outcomes 1,2,3 from (p{y). 

2. Given Y = i, generate a random variate from the corresponding pdf hi(x). 
Denote the resulting outcome by r, r = 1, . . . , n — 5. 

3. Let Xk+i he the city that is matched to the outcome r above. 

Remark After several iterations with the CE Algorithm (4.7.3) we will 
typically have that J3 > f = 1,2, and that (for large n) the number of 
elements T3 in the third group will be much smaller than in the first two. 
This means that we will be sampling mainly from h^{x) rather than from 
hj{x)^j = 1,2, which is beneficial since for small matching the outcome r 
to Xk+i is fast. 

Remark Note that for n = T 2 = 0, we obtain that ts = n — s and thus 
arrive at conventional sampling. When T2 = we obtain only two groups — 
the second group being empty. In such case it is again desirable to choose the 
elements of the first group < i, while the elements of the third group > 



4.11.3 An Analysis of Algorithm 4.5.2 for a Partition Problem 

In this section we show that for a special case of the partition problem (or 
the max-cut problem) the CE algorithm can be evaluated exactly. Specifically, 
we can employ the deterministic rather than the stochastic version of Algo- 
rithm 4.5.2. This can be used to examine the behavior of the deterministic 
CE algorithm, and enables us to compare the stochastic and deterministic 
versions of the algorithm. 

Consider a fully connected network with n = 2m. Assume that 



or m+l<i,j<n; 

5, otherwise. 



where a < b. Thus, (4.65) defines the symmetric distance matrix 



(4.66) 



where all components of the upper left-hand and lower right-hand quadrants 
equal a and the remaining components equal b. 

We want to partition the graph into two equal parts (sets), such that the 
sum of distances (weights of the edges running from one subset to the other) 
is maximized. 

Clearly, the optimal partition is V* = {Vf, Vf}, with Vf = {!,..., m}. We 
assume 1 G Vi. Moreover, by symmetry, we can assume that the distribution 
of cut vector X = (Xi,X2 , . . . , X^) at stage t of the CE algorithm is of the 




180 4 Combinatorial Optimization via Cross-Entropy 



form pt = so at each iteration we only have two 

parameters to update. 

Denote hy Ai Q A', i = — 1 the subsets of X of all possible 

partitions obtained from the optimal partition V* by replacing i nodes from 
Vi with i nodes from V 2 • It is readily seen that 

m—l 

X= [j Ai. (4.67) 

2=0 

The cardinality of Ai represents the total number of ways in which we can 
choose i nodes from \ {1} and i nodes from F 2 *? which is given by 



The cardinality of the set of all possible bipartitions is 



m—l 



2=0 



n! 

2m! m! 



Denote by 5^, z = 0, . . . , m — 1, the cost of a bipartition associated with 
the subset Ai. Then, the optimal bipartition V* corresponds to the cost So = 
= hrm?, and 

Si = So — {b — a)i{n — i) = hw? — (h — a)i{n — i) . (4.69) 



We shall calculate now the deterministic sequence of triplets {( 7 t,Pt,i,Pt, 2 )} 
(t = 1,2, . . .) using (4.11), (4.12), (4.67) and (4.68). 

First, note that 



Pp.(x G . 4 ,) = x . (4.70) 

We assume that po = (1,|,...,|). That is, with probability ^ we 

generate any of 1^1 possible bipartitions. 

Second, for a given pt_i and g, say ^ = 0.01, we determine 7 t from (4.12). 
Since S{X) can only take the values > 5i > . . . > Sm~i on the sets 
Ao, - • - , Am~i, it follows that jt is given by 

7t “ ^m—k ? 

where k is the largest integer fc (1 < A: < m) such that 

m—l 

^ Sm—k) — ^ ^ ^ ^ Q 5 




4.11 Appendices 181 



where Pp.- j(X e Ai) is given in (4.70). Of course k can be different at each 
iteration. 

Next, in the usual way we find that the deterministic updating formula 
(4.12) for is 



Pt,i = 



^Pt-i-^{g(X)^7t}(»l - 1) ^ Yh=2 

®Pt-i-^{S'(X)^7*} 



(4.71) 



and similar for pt ,2 (replacing (m - 1) ^ ^ Sr=m+i 

denominator of (4.71), is given by 



m—k 



' Pt 



.,(5(X) > Sm-k) = Y. e -^) 



i=0 



and the numerator by 



Ept-J{S(X)^7t} 



m — 1 



Tn—k 



i=0 

m—k 



Y, Ppt-i-^fxe.Ai 



2=0 

m—k 



m — l — i 
m — 1 



E TTL 1 'i J 7 

jn-l ^ ■ 

i=0 



Similar explicit expressions can be found for pt^ 2 - Hence, we can explicitly 
calculate the sequence of triplets {(7t,Pt,i,Pt,2)} as a function of t,n,g,a,b 
and also the stopping time T of Algorithm 4.5.2, as a function of n, g, a, b. 

We employed Algorithm 4.5.2 for the model (4.65)-(4.66) using both sim- 
ulation and the analytical formulas above. For the former. Algorithm 4.5.2 
generated the standard stochastic sequences of tuples {(7t,Pt)}, while for the 
latter we replaced {(7t, Pt)} by the deterministic sequences of tuples {(7t, Pt)}. 
We obtained that the theoretical and the simulation result are very close pro- 
vided that N > lOn in the latter. Consider, for example, the simulation 
results in Table 4.5 for the cases Zi = 4 and Z 2 = 4.9, which correspond to 
^1(^1 ) and £2(^2)- For this case we found that for TV > lOn the deterministic 
{(7tf Pt)} and the stochastic {(7t, Pt)} sequences are, in fact, indistinguishable 
in the sense that the stopping time T is the same for both cases, and after 
several iterations with Algorithm 4.5.2 both sequences {7^} and {7^} differ by 
no more than 10“^. 



4.11.4 Convergence of CE Using the Best Sample 

The purpose of this section is to provide insight in the convergence properties 
of the CE algorithm. We consider thereto a very simple, but also general. 




182 4 Combinatorial Optimization via Cross-Entropy 



CE-like algorithm in which the updating is performed on the basis of a single 
elite (best) sample. In contrast, Algorithm 4.2.1 updates the parameters on 
the basis of \gN] elite samples. We investigate under what general conditions 
the algorithm will converge to the optimal solution. We show that conver- 
gence is guaranteed, provided that the probability of obtaining a better value 
in the next iteration increases fast enough and in a smooth way. For alterna- 
tive convergence proofs of the CE method, see Chapter 3 and [85, 113, 145], 
respectively. 

Let be a finite set and S{x) be a real- valued function on A'. We are 
interested in finding the maximum of 5 on A', and to this end use the following 
general randomized algorithm: 

Algorithm 4.11.5 

1. Pick a point A* G A' according to a probability density ft on X . 

2. Compute the quantity S{Xt) and the best found value so far of S 

St = max S{Xr) . 

3. Update the probability density ft, thus getting a new probability density 
ft-\-i, replace t with t-\-l, and loop. 

The above description, of course, is generic — we have not specified how 
the initial density fi is chosen and what is the updating ft •-> /t+i- Our goal 
is to point out general conditions under which we can guarantee the following 
two desirable properties of the algorithm: 

A. Convergence with probability 1 as t oo, of the best found value so far 
St of S to the maximum value 7 * = max5(x); 

x^X 

B. Convergence, with probability 1 as t 00 , of the density ft to a density 
supported by the set Af* of maxima of S' on A'. 

A natural sufficient condition for Property A is given by the following propo- 
sition. 

Proposition 4.19. Assume that 

(i) The initial density is positive everywhere on X: 

fi{x) > 0 for all X £ X \ 

(ii) The updating rules ft ^ ft+i o>re “safe” in the following sense: 

ft+i{x) ^ ft{x) forallxeX, 
where at ^ 0 are such that 

T-l 

^ a* < ln(T) + C 

t=l 

for allT >2 and a certain C G (0, 00). 



(4.72) 




4.11 Appendices 183 



Then, 

¥{St = 7* for all large enough values oft) = l. (4.73) 

Proof. Let x* be one of the maximizers of 5 on A'. From the description of 
our generic algorithm it is clear that 

1 — ¥{St = 7* for all large enough values of t) ^ F{Xt ^ x* for all t) 

so that to prove (4.73) it suffices to verify that F{Xt ^ x* for all t) = 0, or, 
(which is the same) that 

lim ¥(Xt ^ X*, t = 1, . . . ,T) = 0 . (4.74) 

T-)-oo 

Setting p = /i(x*), we have 

/t(x*) > ^ pexp{—C} 

where the concluding inequality is implied by (4.72). Consequently 
F{Xty^x*,t = l,...,T)= fl(l-/t(^*)) 

t=l 

^ fl (1-^^“') 

t=l 

< n exp{-/3i“^} 

t-1 

[since exp{— s} ^1 — s^OforO^s^l] 

T 

= exp{— /3 -^0, T -> oo , 

t=i 

and (4.74) follows. 

Now let us give a sufficient condition for Property B. Consider a realization 
Xi,X 2 , ... of algorithm’s trajectory. We call r the record time for the sequence 
S{xi), if r = 1 or r > 1 and S{xt) < S{xr) for all t < r (that is, at time 
instant r we pick a point which is better, in terms of S, than all points we 
have seen before). For 1, let r{t) be the last record time which is ^ t, and 
let X* = x^(t) be the record at instant t (that is, the best, in terms of 5, among 
the points xi, ..., x^; if there are several best points among xi, ..., Xt, the record 
x^ is the first of them). Let us make the following natural assumption on the 
updating rules ft i-> ft^i: 

(*) Assume that a point x E X becomes a record at a certain time 
T and remains record till time U ^ T: x = x^ = x'^^^ = ... = x^ 

(and 5(x^~*"^) > S{x)). There exists a sequence {pt ^ (perhaps 

oo 

depending on x and T as on parameters) such that ^ = oo and 

t=i 

1 - ft+i{x^) ^ (1 - A)(l - t = T,T+l, U . 



(4.75) 



□ 




184 4 Combinatorial Optimization via Cross- Entropy 

(*) says that the rule for updating the current probability density increases 
the probability of the current record, and this increase is, in a sense, not too 
small. 

Proposition 4.20. Under condition (*), the density ft converges, with prob- 
ability 1 as t oo, to a single-point density. If, in addition to (*), the condi- 
tions of Proposition take place, the limiting distribution with probability 
1 corresponds to a maximizer of S on X . 

Proof. Since X is finite, the sequence of records, along every trajectory of 
the algorithm, stabilizes: starting from a time instant T (depending on the 
trajectory), we have — x^^^ = — Let {/?t} be the sequence corresponding 
to (x^,T) in view of (*). Then by (*) for t > T one has 



1 - ^ (1 - /?t) • • ■ (1 - /3f-i)(l - frix'^)) 

and the left-hand side in this inequality tends to 0 as t oo due to (3t = oo. 
Thus ft{x'^) -> 1 as t oo, and therefore the density ft along the trajectory 
in question converges to the single-point density corresponding to the point 
— the last record in the realization. Under conditions of Proposition 4.19, 
this point, by Proposition 4.19, is, with probability 1, a maximizer of S on X. 

Example 4-21 (Bernoulli Case). Consider the case where 

• A' = {0, l}’^ is the collection of all 2'^ n-dimensional vectors x with coor- 
dinates Xi taking values 0 and 1; 

• ft, for every t, is the pdf of X = {Xi, . . . , Xn) where X has independent 
components with Xi ~ Ber{pt^i)', in other words, 

i=l 

Let pt = (pt,i, • • • iPt,n)^ Assume, for the sake of simplicity, that pi of f\ has 
all coordinates equal to 1/2 and that the rules for updating ft /t+i are as 
follows: 

At step t, after the corresponding record is defined, we convert p* 
into Pt+i (which induces the transformation ft ^ /t+i) by shifting 
“in the directions of xf.” Specifically, for certain given real numbers 
/3t,i > 0, we decrease pt^i for i with = 0 and increase pt^i for i with 
x^ = 1 according to the following rules: 

• In the case of x\ = 0: 



Pt+1,2 — (1 Pt,i) Pt,i • 



• In the case of x\ = 1: 



1 — (1 /?t,i)(l Pt,i) • 




4.12 Exercises 185 



It can be easily verified that if 

6 — £- 1 . 

— ^ min I3t,i ^ max j3t,i < — , 
t l^i^n l^i^n t 

with 0 ^ ^ ^ then the conditions of Proposition 4.19 as well as 

condition (*) are satisfied. 



4.12 Exercises 

Max-Cut 

1. Consider the deterministic max-cut algorithm in Example 4.7. 

a) Repeat the steps of the example, using g = 1/3 instead of ^ = 0.1. 
Show that pt converges to p* = (1, 1, 0, 0, 0). 

b) Repeat the same, but now with the indicator /{ 5 (x)^ 7 t} in Step 3 of 

Algorithm 4.2.2 replaced with Show that pt again 

converges to p*. Does the deterministic version of Algorithm 4.5.2 
converge with the same number of iterations as in (a)? 

c) Run Algorithm 4.2.1 on this problem and obtain a table similar to 
Table 2.3. 

TSP 

2. Consider the 4-node TSP with distance matrix 

/oo 1 2 3 \ 

2 oo 1 3 

^ ~ 2 1 oo 3 

\1 3 2 oo/ 

Carry out, by hand, the deterministic CE Algorithm 4.2.2 for this problem, 
using node transitions (see also Algorithm 4.7.3 for more details). Take 
the following transition matrix 

/ 0 Pl2PlSPl4\ 
p _ P21 0 P23 P24 

PSl P32 0 P34 
\P41 P42 P43 0 / 

set g = 1/3. Start with the initial Pq where all off-diagonal elements are 
1/3. 

3. Run Algorithm 4.2.1 on the data from the URL 

http : //www . iwr . uni-heidelberg . de/groups/comopt/ sof tware/TSPLIB95/ atsp/ 
and obtain a table similar to Table 2.5. 




186 4 Combinatorial Optimization via Cross-Entropy 



4. Choose a benchmark TSP of approximately 100 nodes. Select 15 nodes 
at random, and construct a miniature TSP as described in Remark 4.13. 
Find a good choice for (a, C, g) by trial and error, and apply this to the 
original problem. Check if this procedure produces better results than the 
“standard” parameter choice (0.7,5,0.01). 

5. Select a TSP of your choice. Verify the following statements about the 
choice of CE parameters: 

a) Reducing g or increasing a the convergence is faster but we can be 
trapped in a local minimum. 

b) Reducing g one needs to decrease simultaneously a and visa versa in 
order to avoid convergence to a local minimum. 

c) Increasing the sample size N one can simultaneously reduce g or (and) 
increase a, 

6. The longest path problem (LPP) is a celebrated problem in combinatorics. 
Consider a complete graph with n nodes. With each edge from node i to 
j is associated a cost Cij. For simplicity we assume that all costs are 
nonnegative and finite. Some edges can have zero cost. The goal is to find 
the longest self- avoiding path from a certain source node to a sink node. 
The complexity of any known LPP algorithm grows geometrically in the 
number of nodes. This is in sharp contrast with the shortest path problem 
(SPP) for which efficient (i.e., polynomially bounded) algorithms exist, 
such as Dijkstra’s algorithm [48] or the Bellman-Ford algorithm [21, 56], 
which can also be applied to negative- weight graphs. 

a) Assuming the source node is 1 and the sink node is n, formulate the 
LPP similar to the TSP in (4.32). (Note that the main difference with 
the TSP is that the vectors, or paths, in the LPP can have different 
lengths.) 

b) Specify a path generation mechanism and the corresponding updating 
rules for the CE algorithm. 

c) Verify that the ODTM P* in general does not have only ones and 
zeros, in contrast to the ODTM of the TSP. 




5 



Continuous Optimization and Modifications 



In this chapter we discuss how the CE method can be used for continuous 
multi-extremal optimization, and provide a number of modifications of the 
basic CE algorithm, applied to both combinatorial and continuous multi- 
extremal optimization. Specific modifications include the use of alternative 
“reward/loss” functions (Section 5.2), and the formulation of a fully auto- 
mated version of the CE algorithm (Section 5.3). This modified version allows 
automatic tuning of all the parameters of the CE algorithm. Numerical results 
for continuous multi-extremal optimization problems and the FACE modifi- 
cations are given in Sections 5.4, and 5.5, respectively. 



5.1 Continuous Multi- Extremal Optimization 

In this section we apply the CE Algorithm 4.2.1 to continuous multi-extremal 
optimization problems. We consider optimization problems of the form 
maxxGAf S{x.) or minxeA’ *S'(x), where the objective function 5 : M is ei- 

ther unconstrained (Af = E’^) or the constraints are very simple (for example, 
when A is an n-dimensional rectangle [a, b]’^). Generation of a random vector 
X = (Xi, . . . , Xn) G A in such cases is straightforward. The easiest way is to 
generate the coordinates independently from an arbitrary 2-parameter distri- 
bution, such that by applying Algorithm 4.2.1 the joint distribution converges 
to the degenerated distribution at the point x* where the global extremum is 
attained. Examples of such distributions are the normal, double-exponential, 
and beta distributions. 

Example 5.1 (Normal Updating). We wish to optimize the function S given 

by 

5(ar) = + 0.8 , a; € R . 

Note that S has a local maximum at point —2.00 (approximately) and a 
global maximum at 2.00. At each stage t of the CE procedure we simulate 




188 5 Continuous Optimization and Modifications 



a sample Xi, . . . from a N(//t_i, distribution, determine % in the 
usual way, and update and at as the mean and standard deviation of all 
samples Xi that exceed level 7^; see (3.66) and (3.67). Note that the likelihood 
ratio term is not used. A simple Mat lab implementation may be found in 
Appendix A. 3. The CE procedure is illustrated in Figure 5.1, using starting 
values /Iq = — 6, a^ = 100 and CE parameters a = 0.7, p = 0.1 and N = 100. 
The algorithm is stopped when the standard deviation becomes smaller than 
0.05. We observe that the vector {Jit, at) quickly converges to the optimal 
(//*,cr*) = (2.00,0), easily avoiding the local maximum. 



f [X] 




S[x] 




Fig. 5.1. Continuous Multi-Extremal Optimization. 



The normal updating procedure above is easily adapted to the multidi- 
mensional case. Specifically, when the components of the random vectors 
Xi,...,Xiv are chosen independently, updating of the parameters can be 
done separately for each component. While applying Algorithm 4.2.1, the 
mean vector should converge to x* and the vector of standard deviations 
a to the zero vector. In short, we should obtain a degenerated pdf with all 
mass concentrated at the vicinity of the point x*. When using Beta (a, b) ran- 




5.1 Continuous Multi- Extremal Optimization 189 



dom variables the updating of the parameters should be obtained from the 
numerical (rather than analytical) solution of (4.8), but again, after stop- 
ping Algorithm 4.2.1, the resulting parameter vectors a and b should define 
a degenerated pdf at the vicinity of the point x*. 

Remark 5.2 (Smoothing Scheme). During the course of Algorithm 4.2.1 
the sampling distribution “shrinks” to a degenerate distribution. When the 
smoothing parameter a is large, say 0.9, this convergence to a degenerate dis- 
tribution may happen too quickly, which would “freeze” the algorithm in a 
subopt imal solution. To prevent this from happening the following dynamic 
smoothing scheme is suggested (considering the case where the sampling dis- 
tribution is normal) : At iteration t update the variance using a smoothing 
parameter 

A = /3-/3(l-iy, (5.1) 

where ^ is a small integer (typically between 5 and 10), and /3 is a large 
smoothing constant (typically between 0.8 and 0.99). The parameter ji can be 
updated in the conventional way, with constant smoothing parameter a, say 
between 0.7 and 1. By using f3t instead of a the convergence to the degenerate 
case has polynomial speed instead of exponential. We have used (5.1) so far 
only for the Rosenbrock case; see below. 

In Section 5.4 we demonstrate the performance of Algorithm 4.2.1, where we 
will consider the minimization of two well-known test functions: the Rosen- 
brock function 

n— 1 

S'(x) = ^ 100 {Xi+x - xlf + {Xi - 1)^ (5.2) 

2=1 

and the trigonometric function 

n 

5(x) = H-^8sin^(ry(xi -x*)^) + 6sin^(2?7(xi -a:*)^) -h/i(xi -x*)^ . (5.3) 
2=1 

The graphical representations of Rosenbrock’s and the trigonometric func- 
tions for 7/ = 7, /i = 1, X* = X* = 0.9 in the two-dimensional case are given 
in Figures 5.2 and 5.3, respectively. It is not difficult to see that in the n- 
dimensional case the global minima for the Rosenbrock and the trigonometric 
function are attained at points x* = (1, 1, . . . , 1) and x* = (0.9, 0.9, . . . , 0.9), 
respectively. The corresponding minimal function values are 5(x*) = 0 and 
5(x*) = 1, respectively. If not stated otherwise we assume (as in [145]) for 
the trigonometric function that rj = 7, fjb = 1. 




190 5 Continuous Optimization and Modifications 




Fig. 5.2. Rosenbrock’s function in for — 1 ^ ^ 1. 




Fig. 5.3. The trigonometric function in with rj = 7, = 1, x* = x* =0.9 and 

— 1 ^ ^ 1 . 



5.2 Alternative Reward Functions 

Consider the maximization problem (4.2) where 5(x) is some positive ob- 
jective function defined on A'. In the associated stochastic problem (4.3) we 
consider instead estimating the rare-event probability 



= Eu/{5(X)^7> • 




5.3 Fully Adaptive CE Algorithm 191 



Note that this can be written as 



%)=E„<^(5(X);7), 



where (^(5; 7) is the indicator that is, 




1 if 5^7, 
0 if 5 < 7 , 



(5.4) 



for 5 ^ 0. The standard CE algorithm contains the now^ well-known steps 
(a) updating % via (4.6), and (b) updating Vt via the (analytic) solution of 
(4.8). A natural modification of Algorithm 4.2.1 would be to update Vt using 
an alternative function For a maximization problem such a function 

should be increasing in s for each fixed 7^0, and decreasing in 7 for each 
fixed s ^ 0. In particular one could use 



¥>(s;7) , 

for some increasing function '0(s). Using such a (^(s;7) instead of the indi- 
cator (5.4) we now proceed similarly to how we went before. Specifically, the 
updating step (a) of % remains exactly the same, and the updating step (b) 
of Vt now reduces to the solution of the following program 



1 ^ 

max5(v) = max - V>(5(X,)) ln/(Xi; v) . 



2=1 



(5.5) 



Numerical evidence suggests that 'tp{s) = s can lead to some speed-up of the 
algorithm. As an example, for the max-cut problem we obtain in this case 
(see (4.24)) the analytic updating formulas 



s(Xi) 



(5.6) 



i = 2, . . . ,n. 

However, other numerical experiments suggest that high-power polynomi- 
als, 'ip(s) = with large (3, and the Boltzmann function '0(s) = e“®/^ are not 
advisable, as they may lead more easily to local minima. 



5.3 Fully Adaptive CE Algorithm 

We present a modification of Algorithm 4.2.1, called the fully adaptive (or 
automated) CE (FACE) algorithm in which the sample size is updated adap- 
tively at each iteration t of the algorithm, that is, N = Nt. In addition, this 
modification is able to identify some “difficult” and pathological problems. 
Consider a maximization problem and let 




192 5 Continuous Optimization and Modifications 

St,{l) ^ • • • < 

denote the ordered sample performances of the Nt samples at the t-th itera- 
tions. To make notation easier, we denote St^[Nt) by S^. 

The main assumption in the FACE Algorithm is that at the end of each it- 
eration t the updating of the parameters is done on the basis of a fixed number, 
say, of the best performing samples, the so-called elite samples. Thus, 
the set of elite samples £t consist of those samples in {Xi, . . . , XjVt} for 

which the performances 5(Xi), . . . , 5(XjVt) are highest. The updating Steps 
2 and 3 of Algorithm 4.2.1 are modified such that 

It — , 

and 

Vt = argmax ^ ln/(Xj; v) . 

Note that % is equal to the worst sample performance of the best 
sample performances, and is the best of the elite performances (indeed, of 
all performances). 

In the FACE algorithm the parameters q and N of Algorithm 4.2.1 are up- 
dated adaptively. Specifically, they are “replaced” by a single parameter: the 
number of elite samples The above updating rules are consistent with 

Algorithm 4.2.1 — provided we view q in Algorithm 4.2.1 as the parameter 
which changes inversely proportional to Nt'. Qt = 

It was found experimentally that a sound choice is = cqu and 

jy-ehte _ for SNNs and SENs, respectively, where cq is a fixed posi- 

tive constant (usually in the interval 0.01 < cq < 0.1). The easiest way to 
explain FACE is via the flow chart in Figure 5.4. 

For each iteration t of the FACE algorithm we design a sampling plan 
which ensures with high probability that 

s: > SU . (5.7) 

Note that (5.7) implies improvement of the maximal order statistics (best 
elite performance) at each iteration. To ensure (5.7) with high probability, we 
allow Nt to vary at each iteration t in a quite wide range 

A/-min < 

where, say, for SNN-type problems N^^^ = n (or N^^^ = A/"®^^*®), and 
A/'max = 20 n. Note that we always start each iteration by generating N^^^ 
samples. If at any iteration t, while increasing Nt we obtain that Nt = N^^^ 
and (5.7) is violated, then we directly proceed with updating ( 7 t, Vt) accord- 
ing to Algorithm 4.2.1 before proceeding with the next iteration of the FACE 
Algorithm. However, if FACE keeps generating samples of size N^^^ for sev- 
eral iterations in turn, say, for c = 3 iterations, then we stop, and announce 




5.3 Fully Adaptive CE Algorithm 193 




No 



Fig. 5.4. The flowchart for the FACE algorithm. 

that FACE identified a “hard” problem for which the estimate of the optimal 
solution is unreliable. Parallel to (5.7) we will require at each iteration that 

7t > 7t-i • (5-8) 

Note that (5.8) implies improvement of the worst elite sample performance 7 t 
at each iteration. 

Similar to Algorithm 4.2.1 we initialize by choosing some vq. For exam- 
ple, for the max-cut problem we choose (using p instead of v), po = Po = 
(1, 1/2, .. . ,1/2). We assume that the FACE parameters AT"''”, a and 

the stopping constants c and d are chosen in advance. For example, for the 
max-cut problem we take = n. We let t = 1 and Ni = and proceed 
as follows: 




194 5 Continuous Optimization and Modifications 

Algorithm 5.3.1 (FACE Algorithm) 

1. At iteration t, t = 1, 2, . . . take an initial sample of size Nt, with < 

Nt < from /(-;vt_i). Denote the corresponding ordered sample 

performances by 5^,(1) < • • • 

2. If (5.7) or (5.8) holds, proceed with the updating steps (4-6) and (4.8) 
using the Nt samples in Step 1. 

3. If ( 5. 7) and (5.8) are violated, check whether or not 

s;^--- = su. (5.9) 

If SO, stop and deliver as an estimate of the optimal solution. Call such 
St a reliable estimate of the optimal solution. If (5.9) does not hold, and 
if Nt < N^^^, increase Nt by 1, recalculate and %, and repeat Step 3. 

4 . If Nt = N^^^ and each of (5.7), (5.8) and (5.9) is violated, proceed with 
( 4 . 6 ) and ( 4 . 8 ) using the N^^^ samples mentioned in Step 1 and go to 
Step 3. 

5. If Nt = N^^^ for several iterations in turn, say for c = 3 iterations, 
and each of (5.7), (5.8) and (5.9) is violated, stop and announce that 
FACE identified a ‘%ard” problem. Call an unreliable estimate of the 
optimal solution. 

The stopping criterion (5.9) means that the best samples in the last d 
iterations are the same. Note that if (5.7) holds for allt>l we automatically 
obtain that Nt = N^^^ for all t. In such case FACE reduces to the original Al- 
gorithm 4.2.1. In Section 5.5 we illustrate a number of numerical experiments 
with the FACE algorithm. 



5.4 Numerical Results for Continuous Optimization 

We apply Algorithm 4.2.1 to the Rosenbrock and trigonometric functions of 
Section 5.1. For both cases we use the normal sampling distribution with inde- 
pendent components. For an n-dimensional function we thus need to update 
at each iteration n means and n variances. We observed that the algorithm 
finds the global optimum flawlessly, even for highly irregular cases such as the 
Rosenbrock function with n = 50. For the Rosenbrock function it is impor- 
tant that the modified smoothing scheme of Remark 5.2 is applied, in order 
to obtain convergence to the global minimum. For the trigonometric func- 
tion the standard (fixed) smoothed updating scheme suffices. The updating 
of the parameters is carried out on the basis of a fixed number of samples 
jyehte _ The Matlab implementation can be found in Appendix A.5. 

Table 5.1 presents the evolution of Algorithm 4.2.1 for the minimization of 
the Rosenbrock function on the region [-2, 2]^^. We deal with this constraint 
by imposing a high penalty on the original Rosenbrock function outside this 




5.4 Numerical Results for Continuous Optimization 195 



region. Specifically, for each coordinate Xi outside [—2, 2] we add 10^ di to the 
original Rosenbrock function, where di is the distance from Xi to the inter- 
val [—2,2]. We take N = 1000, a = 0.8 and = 20. For the modified 

smoothing scheme of (5.1) we use /3 = 0.9 and q = ^. The initial means are 
chosen independently from a uniform distribution on [-2,2]; the initial vari- 
ances are lO'^. In the table denotes the best (that is, smallest) function 
value found in iteration t, 7 t the worst of the elite performances, and the 
vector of means at the t-th iteration. In repeated experiments the global min- 
imum was consistently found in less than 9 seconds on a 1.8GHz computer. 
It is interesting to note that the first component always converges first, then 
the second, then third, etc. 



Table 5.1. The evolution of Algorithm 4.2.1 for the Rosenbrock function with 
n = 10. 




Table 5.2 presents the evolution of Algorithm 4.2.1 for the trigonometric 
function with rj = 7, = I and n = 10. The same notation is used as in 

Table 5.1. The CE parameters are exactly the same as for the Rosenbrock 
case. In repeated experiments the global maximum was consistently found in 
less than 2 seconds on a 1.8GHz computer. Different values for the parameters 
fi and u had little effect on the excellent accuracy of the method. 




196 5 Continuous Optimization and Modifications 



Table 5.2. The evolution of Algorithm 4.2.1 for trigonometric function with rj = 7, 
= 1 and n = 10. 



t 


7 t 




El U 


10 


145.890 


100.990 


0.29 


1.06 


0.42 


1.36 


0.99 


0.32 


1.51 


1.69 


2.32 


1.13 


20 


57.356 


43.581 


1.25 


0.93 


0.79 


1.03 


0.77 


0.70 


1.38 


1.08 


1.40 


0.93 


30 


44.173 


28.687 


1.26 


0.83 


1.16 


0.93 


0.67 


0.80 


0.87 


1.11 


1.00 


1.00 


40 


35.233 


23.276 


0.95 


1.01 


0.71 


0.86 


0.86 


1.08 


0.81 


0.69 


1.40 


0.96 


50 


31.960 


18.373 


0.77 


0.94 


0.77 


0.97 


0.88 


0.93 


0.70 


0.77 


1.02 


0.79 


60 


27.461 


10.693 


0.89 


1.02 


1.07 


0.85 


0.85 


0.99 


0.94 


0.87 


1.11 


0.94 


70 


21.804 


10.954 


0.94 


0.91 


0.86 


0.83 


0.97 


0.95 


0.84 


0.94 


0.92 


0.94 


80 


17.044 


10.617 


0.85 


0.81 


0.79 


0.89 


0.95 


0.89 


0.94 


0.84 


0.80 


0.88 


90 


10.247 


4.550 


0.89 


0.88 


0.89 


0.89 


0.85 


0.90 


0.90 


0.90 


0.87 


0.89 


100 


6.667 


2.646 


0.91 


0.91 


0.91 


0.87 


0.90 


0.90 


0.88 


0.95 


0.95 


0.86 


110 


3.459 


1.706 


0.92 


0.87 


0.92 


0.91 


0.85 


0.90 


0.88 


0.89 


0.90 


0.92 


120 


2.226 


1.155 


0.89 


0.94 


0.90 


0.90 


0.89 


0.90 


0.90 


0.90 


0.88 


0.90 


130 


1.808 


1.137 


0.92 


0.90 


0.90 


0.90 


0.90 


0.88 


0.89 


0.91 


0.90 


0.90 


140 


1.535 


1.080 


0.91 


0.88 


0.88 


0.91 


0.91 


0.87 


0.90 


0.89 


0.87 


0.89 


150 


1.310 


1.053 


0.90 


0.91 


0.90 


0.90 


0.90 


0.91 


0.88 


0.92 


0.89 


0.89 


160 


1.209 


1.051 


0.90 


0.90 


0.88 


0.90 


0.90 


0.91 


0.91 


0.89 


0.92 


0.90 


170 


1.115 


1.007 


0.89 


0.89 


0.90 


0.91 


0.90 


0.90 


0.90 


0.89 


0.89 


0.91 


180 


1.076 


1.033 


0.91 


0.89 


0.92 


0.89 


0.90 


0.91 


0.90 


0.90 


0.89 


0.90 


190 


1.046 


1.010 


0.90 


0.89 


0.90 


0.91 


0.89 


0.90 


0.90 


0.90 


0.90 


0.91 


200 


1.041 


1.017 


0.89 


0.91 


0.89 


0.91 


0.91 


0.91 


0.90 


0.90 


0.90 


0.90 


210 


1.026 


1.004 


0.90 


0.90 


0.90 


0.90 


0.90 


0.91 


0.89 


0.90 


0.89 


0.91 


220 


1.018 


1.008 


0.89 


0.90 


0.90 


0.90 


0.89 


0.90 


0.90 


0.90 


0.90 


0.90 


230 


1.013 


1.004 


0.90 


0.90 


0.91 


0.90 


0.89 


0.90 


0.90 


0.89 


0.90 


0.90 


240 


1.010 


1.003 


0.90 


0.90 


0.90 


0.90 


0.91 


0.90 


0.90 


0.91 


0.91 


0.90 



Remark 5.3 (Noisy Optimization). The experiments for the continuous opti- 
mization experiments above were repeated for their noisy counterparts, that 
is, when the objective functions are corrupted with noise. The CE algorithm 
worked well, provided the parameters were suitably altered (e.g., by increas- 
ing the sample size or decreasing (3). Noisy optimization will be considered in 
detail in Chapter 6. 



5.5 Numerical Results for FACE 

Table 5.3 represents the results of the FACE Algorithm 5.3.1 applied to the 
artificial max-cut problem of Table 4.6. The parameter setup is almost the 
same, except for the fact that the sample size Nt is variable. In particular we 
set Co = 0.1, = 200, a = 0.7. We obtained relative experimental error 

e = 0.01. Comparing the results of Tables 4.6 and 5.3 indicates that the FACE 
algorithm behaves very similarly to the basic CE algorithm 4.5.2. But the big 
advantage is that the parameters q and N need not be specified in advance 
and that sample size Nt for FACE is usually much smaller than the sample 
size 1000 in Table 4.6. 




5.5 Numerical Results for FACE 197 



Table 5.3. Evolution of the FACE algorithm for the artificial max-cut problem 
(4.51) of Table 4.6, with n = 200, m = 100, 5=1, and Z = 0. 



t 


7t 


■S'f.(JVt) 


Nt 


p tmax 


^^min . 


1 


5050 


5162 


200 


0.5000 


0.0000 


2 


5088 


5234 


207 


0.5000 


0.0000 


3 


5132 


5260 


200 


0.5000 


0.0031 


4 


5286 


5560 


200 


0.5000 


0.0031 


5 


5510 


5840 


200 


0.5000 


0.0035 


6 


5760 


6254 


200 


0.5000 


0.0015 


7 


6064 


6560 


200 


0.5000 


0.0042 


8 


6404 


7046 


200 


0.5000 


0.0017 


9 


6768 


7516 


200 


0.5000 


0.0005 


10 


7146 


7800 


200 


0.5000 


0.0138 


11 


7584 


8040 


200 


0.5000 


0.0095 


12 


7964 


8280 


200 


0.5000 


0.0028 


13 


8432 


9048 


200 


0.5000 


0.0018 


14 


8864 


9140 


312 


0.5000 


0.0522 


15 


9136 


9324 


200 


0.5000 


0.0051 


16 


9324 


9606 


200 


0.5000 


0.0549 


17 


9606 


9900 


200 


0.5000 


0.1615 


18 


9900 


9900 


4000 


0.5000 


0.3984 



Table 5.4 represents the results of the FACE Algorithm 5.3.1 applied to 
the artificial TSP of Table 4.15. Again, the parameter setup is the same. The 
FACE parameters are cq = 0.1, = 900, a = 0.7. We now obtained a 

relative experimental error 6 = 0, that is, the optimal solution is found, which 
is an improvement on what was achieved with the basic method. We see that 
roughly 80% more iterations are needed, but that most sample sizes are less 
than the original 4500. 

Table 5.5 illustrates the evolution of the FACE algorithm applied to the 
benchmark f t53 TSP of Table 2.4. 

Figure 5.5 and Table 5.6 present a more detailed comparison between the 
standard and FACE algorithm for the same ft 53 TSP. The results are av- 
eraged over 10 separate runs for each algorithm. In particular. Figure 5.5 
depicts the mean and the standard deviation (vertical bars) of the best tra- 
jectory values produced by the algorithms in each iteration. In addition, we 
have measured the relative experimental error 



£ = 




(5.10) 



where 7* = 6905 is the best trajectory value (known to exist) and 7^ is the 
resulting best trajectory value at the end of the current run; in other words 
7^ = Another important parameter is the total number of samples 




198 5 Continuous Optimization and Modifications 



Table 5.4. Evolution of the FACE algorithm for the synthetic TSP (4.10.1) of 
Table 4.15, with n = 30. Relative experimental error e = 0.0000. 



t 


It 


St,m 


Nt 


pmm 


1 


42.52 


38.39 


900 


0.0567 


2 


40.88 


37.61 


900 


0.0769 


3 


39.27 


36.85 


900 


0.0978 


4 


38.02 


35.37 


900 


0.1031 


5 


36.87 


34.55 


900 


0.1063 


6 


35.87 


34.38 


900 


0.1252 


7 


35.14 


33.44 


900 


0.1537 


8 


34.40 


32.68 


900 


0.1475 


9 


33.77 


32.40 


1511 


0.1220 


10 


33.39 


32.07 


900 


0.1517 


11 


33.00 


31.90 


900 


0.1933 



t 


7t 




Nt 


pmm 


12 


32.78 


31.68 


900 


0.1762 


13 


32.54 


31.33 


900 


0.1540 


14 


32.27 


31.17 


900 


0.1953 


15 


32.08 


31.12 


900 


0.2215 


16 


31.92 


30.83 


900 


0.2432 


17 


31.72 


30.43 


1111 


0.2323 


18 


31.15 


30.43 


18000 


0.2686 


19 


30.92 


30.41 


4949 


0.2900 


20 


30.88 


30.32 


900 


0.2737 


21 


30.63 


30.00 


3219 


0.3018 


22 


30.43 


30.00 


18000 


0.3782 



Table 5.5. Evolution of the FACE algorithm for the f t53 TSP of Table 2.4. Relative 
experimental error e = 0.0284. 



t 


7t 




Nt 


pmm 




t 


7t 


StM 


Nt 


pmm 


1 


24528 


21549 


2810 


0.0281 




29 


8943 


8051 


8288 


0.1863 


2 


23025 


19778 


2810 


0.0361 




30 


8822 


7856 


2810 


0.1829 


3 


21424 


18523 


2810 


0.0406 




31 


8680 


7742 


3794 


0.1894 


18 


11352 


9527 


2810 


0.1015 




32 


8342 


7732 


21759 


0.2088 


19 


11054 


9439 


5425 


0.1102 




33 


8301 


7304 


13943 


0.2070 


20 


10661 


9076 


31020 


0.1252 




34 


8130 


7224 


7531 


0.2091 


21 


10271 


9076 


56200 


0.1283 




35 


7743 


7224 


56200 


0.2336 


22 


10078 


9001 


8497 


0.1334 




36 


7666 


7209 


11591 


0.2514 


23 


9989 


8997 


7326 


0.1247 




37 


7609 


7129 


2810 


0.2672 


24 


9887 


8904 


2810 


0.1395 




38 


7534 


7128 


24968 


0.2745 


25 


9711 


8658 


10219 


0.1589 




39 


7392 


7123 


5864 


0.2741 


26 


9550 


8509 


2810 


0.1722 




40 


7246 


7119 


40641 


0.2518 


27 


9423 


8307 


2810 


0.1688 




41 


7159 


7101 


9080 


0.6110 


28 


9251 


8229 


2810 


0.1727 




42 


7129 


7101 


56200 


0.5046 



produced by the algorithm before stopping. For CE algorithm equals 

T ' N {N = ^71^ - constant) and for FACE algorithm where 

T is the number of iterations in a particular run and Nt is the sample size 
in iteration t. The comparison of minimum, maximum and average values of 
these parameters is given in Table 5.6. These data indicate that the FACE 
algorithm is only slightly better than CE. Indeed, the FACE algorithm requires 
less sample size in each iteration. However, the number of iterations (see 
Figure 5.5) is greater. 




Best Trajectory Value 



5.5 Numerical Results for FACE 199 




Fig. 5.5. Convergence of the CE and FACE algorithms applied to ft53 TSP. 



Table 5.6. Comparison of the performance of the CE and FACE algorithms applied 
to ft53 TSP for 10 different runs. 





CE 1 


FACE 1 


€ 




€ 




min 


0.0245 


379215 


0.0012 


243210 


max 


0.0468 


533710 


0,0647 


539168 


mean 


0.0368 


457867 


0.0373 


376346 






200 5 Continuous Optimization and Modifications 

5.6 Exercises 



Continuous Optimization 



1. Verify Example 5.1, for example by using the program cecont.m from 
Appendix A. 3. Which choice of parameters gives the fastest convergence? 

2. The most simple sampling distribution for continuous optimization is the 
U(a, 5) distribution. 

a) Derive the updating formulas for a and b from (4.8). 

b) The advantage of uniform updating is that it is very fast. What is the 
main disadvantage? 



3. Suppose we use the Beta (a, b) distribution in the CE procedure for contin- 
uous optimization. The updating formulas for a and b follow from (3.41), 
that is 

1 ^ 

m^ D(a, b) = max — E 

i=l 

where /(-;a, 5) is the pdf of the Beta(a, 5) distribution. 

Show that the optimal a and b satisfy 



dln{ria + b)) c>ln(r(a)) ^ 

Eill 



(5.11) 



and 

ain(r(a + 6)) dHr{a)) Ell 

db db T “ ^ 

Although there is no closed-form solution to this system of equations, 
efficient algorithms such as the ellipsoid method [22] can be used to find 
the solution numerically. 



4. Instead of using normal updating, apply the CE method to Example 5.1 
using beta updating. Do this by modifying the program cecont.m of 
Appendix A. 3. You may wish to use the programs of Appendix A. 6; in 
that case you will also need the standard Matlab functions betapdf .m, 
betarnd.m, gamrnd.m, and rndcheck.m from the statistics toolbox. 

5. Run Algorithm 4.2.1 for the Rosenbrock and trigonometric functions in 
Section 5.1 and obtain tables similar to Tables 5.1 and 5.2. Use both the 
normal and the Beta (a, b) pdfs for generating sample trajectories. Observe 
the effect of the modified smoothing scheme (5.1). 

6. Consider the constrained optimization of the Rosenbrock function as in 
Table 5.1; but now generate the components from a truncated normal 




5.6 Exercises 201 



distribution on [—2,2], using the acceptance-rejection method. Note that 
the updating formulas for the mean and variance remain the same. Why? 

7. Although we typically assume that the components in a CE procedure for 
continuous optimization are independent^ it is possible to formulate a CE 
procedure in which the components are dependent In particular, consider 
the CE algorithm where at each stage t we draw Xi , . . . , X^v from a 
multivariate normal distribution with mean vector and covariance 
matrix Ut-i- 

a) Give the updating formulas for fi^ and Et in terms of Xi, . . . , X^v- 

b) Implement this procedure and explore under which conditions the CE 
method with dependent component is better than the simpler (and 
faster) method with independent components. 

8. Although the beta distribution is defined only on [0,1] it can be used to 
optimize functions on any finite interval [c, d ] . Explain how. 

Modifications 

9. Implement a FACE algorithm for the n-queen problem of Exercise 2.6. 
Compare your implementation with that of Appendix A. 4. 

10. Run the FACE Algorithm 5.3.1 on the data from the URL 

http : //www . iwr . iini-heidelberg . de/groups/comopt/sof tware/TSPLIB95/atsp/ 

and obtain a table similar to Table 2.5. Compare the results with those 
obtained using the main CE Algorithm 4.2.1. 

11. Sometimes it is not feasible — due to memory problems — to store all vec- 
tors Xi, . . . , Xiv at Step 2 of the CE Algorithm 4.2.1. A memory-efficient 
alternative is to determine % in Step 2 using a relatively small sample 
size. Once 7t is determined. Step 3 can be carried out using a new sample, 
updating the parameters “online” while ignoring each X^ for which the 
performance is less than 7^. Implement and apply this modification to a 
large max-cut problem. 

12? To obtain a larger increment A^t = 7^ — in the objective function 
S at iteration t we can modify Algorithm 4.2.1 via a big-step procedure 
similar to Algorithm 3.11.1. 

a) Formulate a big-step modification of Algorithm 4.2.1. 

b) Implement the algorithm and apply it to a problem of your choice. 

c) Compare the results with the original CE algorithm. 




6 



Noisy Optimization with CE 



6.1 Introduction 

In this chapter we show how the CE method optimizes noisy objective func- 
tions, that is, objective functions corrupted with noise. Noisy optimization, 
also called stochastic optimization, can arise in various ways. For example, in 
the analysis of data networks [24], a well-known problem is to find a routing 
table that minimizes the congestion in the network. Typically this problem 
is solved under deterministic assumptions. In particular, the stream of data 
is represented by a constant “fiow,” and the congestion at a link is measured 
in terms of a constant arrival rate to the link, which is assumed to depend 
on the routing table only. These assumptions are reasonable only when the 
arrival process changes very slowly relative to the average time required to 
empty the queues in the network. In a more realistic setting however, using 
random arrival and queueing processes, the optimization problem becomes 
noisy. Other examples of noisy optimization include stochastic scheduling and 
stochastic shortest/longest path problems. References on noisy optimization 
include [9, 26, 36, 37, 60, 78]. 

A classical example of noisy optimization is simulation-based optimization 
[148]. A typical instance is the buffer allocation problem (BAP). The objective 
in the BAP is to allocate n buffer spaces amongst the m — 1 “niches” (storage 
areas) between m machines in a serial production line, so as to optimize some 
performance measure, such as the steady-state throughput. This performance 
measure is typically not available analytically and thus must be estimated via 
simulation. A more detailed description of the BAP will be given in Section 6.3, 
but for general references on production lines we refer to [5, 41, 63, 116, 
155]. Buzacott and Shanthikumar [33] provide a good reference on stochastic 
modeling of manufacturing systems, while Gershwin and Schor [61] present a 
comprehensive summary paper on optimization of buffer allocation models. 

The rest of the chapter is organized as follows: In Section 6.2 we explain 
how the standard CE Algorithm 4.2.1 can be easily modified to handle general 
noisy optimization problems. In Section 6.3 we apply the noisy version of 




204 6 Noisy Optimization with CE 



Algorithm 4.2.1 to the BAP. The material in this section closely follows [9]. 
In Sections 6.4 and 6.5 we provide various numerical experiments for the BAP 
and the noisy TSP, respectively. 

6.2 The CE Algorithm for Noisy Optimization 

Consider the maximization problem (4.2) and assume that the performance 
function 5(x) is noisy, that is, corrupted with noise. We denote such a noisy 
function as 5(x). We will assume that for each x we can readily obtain an 
outcome of S'(x), for example via generation of some additional random vector 
Y, whose distribution may depend on x. In that case we write symbolically 
5(x) = 5(x, Y). We introduce noisy optimization via the noisy (stochastic) 
TSP. 

Example 6.1 (Noisy TSP). Consider the deterministic version of the TSP in 
(4.32). Suppose that the matrix (c^j), replaced now by Y = (Y^j), is random. 
Thus, Yij, i^j = 1, . . . , n is a random variable representing the “random” time 
to travel from city i to city j. The total time of a tour x = (xi, . . . ,Xn) is 
given by 

n— 1 

= . ( 6 . 1 ) 

i=l 

Let us assume that EVij = Cij. For example, if Yj ~ f^en EYij = Cij. 

For another example, if 

Yij = — Cij Yij , 

Pij 

where Yij ~ Ber(pi^), (0 < Pij < 1), then clearly again, EYij = Cij- 

We shall associate with 5(x, Y) two different types of problems. The first 
type deals with the random variable 5*(Y) = maxxe;»f 5(x, Y). For example 
we may wish to estimate either 

E5* ( Y) = E max 5(x, Y) (6.2) 

x^X 

or P(5*(Y) ^ 7), that is, to estimate the expected longest tour or the prob- 
ability that the longest tour S*(Y) exceeds some length 7. If 5*(Y) can be 
easily computed for each Y, we are basically back in the situation discussed 
in Chapter 3. 

The second type of problem deals with the maximization problem 

maxES'(x, Y) . (6.3) 

X 

Note that if either the distribution of Y is known, or if 5(x) = E5(x, Y) = 
E5(x) is easily computable for each x, then (6.3) reduces to the deterministic 




6.2 The CE Algorithm for Noisy Optimization 205 



optimization problem (4.2). Note also that (6.3) and (6.2) are different, since 
the expected value of a maximum of a random function is not equal to the 
maximum of the expected value of the same random function. 

For the rest of this section we deal with the maximization problem (6.3) 
alone. We assume that 5(x) = E5(x) is not available, but the sample value 
5(x) (unbiased estimate of E5(x)) is available, for example via simulation. 
For this type of problem we now present the modified (noisy) versions of our 
basic CE formulas (4.6)-(4.8). These are given for completeness only, since we 
replace only S'(x) by 5(x), while all other data remain the same. 

1 . Adaptive updating of 7 ^. For a fixed vt_i, let 7 t be a (1 — ^)-quantile 
of 5(X) under Vt_i. A simple estimator 7 ^ of 7 t is 

7t = 5([-(i_g)jv]) , (6.4) 

where, for a random sample Xi, . . . ,X^v from /(•; Vt_i), 5(^) is the i-th 
order statistic of the performances 5(Xi), . . . , 5(Xjv), similar to (4.6). 

2. Adaptive updating of v^. For fixed 7 ^ and Vt_i, derive Vt from the 
solution of the CE program 

m^D(v) = ln/(X; v) . (6.5) 

The stochastic counterpart of (6.5) is as follows: for fixed 7 ^ and Vt_i, 
derive Vt from the following program 

1 ^ 

m^D(v) = max 

2 = 1 

Clearly, the main CE optimization Algorithm 4.2.1 with noisy functions 
remains the same as for the deterministic one, provided again that 5 (x) is 
replaced by 5(x). It is not difficult to understand that such noisy Algorithm 
4.2.1 will work nicely as well since it will filter out the noise component Y 
from 5(x) = 5(x, Y). 

To provide more insight into the noisy version of Algorithm 4.2.1 assume, 
as before, that E5(x) = E5(x,Y) = 5(x) for all x and that 5(x) has a 
unique maximum 7 * at x*. The first thing to observe is that 7 ^ in the noisy 
and deterministic case — let us denote them by 71 ^ and 72 * — are generally 
different Namely, ju is the (1 — ^)-quantile of the random variable *S(X), 
with X ~ /(•;vt_i) and 72 ^ is the (1 — ^)-quantile of the random variable 
5(X), with X ~ /(•; Vt_i). As a result, the sequences of estimators { 7 it} and 
{%t} will generally converge to different values as well. 




206 6 Noisy Optimization with CE 



Suppose next, as in ( 4 . 4 ), that the theoretically optimal pdf /(*;v*) is 
the delta function with mass at x*. For the deterministic CE algorithm the 
sequence of reference parameters {vit} is steered towards the optimal (de- 
generate) V*, and since E 5 (x) = S'(x), it is conceivable that the sequence 
{v2t} also tends to v*. This is indeed what we have observed numerically, 
although no general proof of this result is yet available. Observe also that 
7* = E 5 (x*,Y), so one can estimate 7* based on the final noisy reference 
parameter ^ 

IT = , ( 6 . 7 ) 

Z = 1 

where Xi, . . . , X/c a random sample from /(•; V2 t)- 

Note finally that trajectory generation for noisy problems is similar to 
their deterministic counterparts. As an example consider again the noisy TSP. 
Here at each iteration t we first generate a sample Xi, . . . , Xjv according to 
a MCWR with transition matrix Pt^ and then, independently, generate Y^, 
k = 1,...,AT. In fact, for each Y^ we only need to generate the random 
costs on the links traversed by the path X^. This produces a sequence of 
tuples {(72t,Tt)} approximating (72, P*), where P* is the ODTM for the 
deterministic case and 72 the (1 — ^)-quantile of 5 (X,Y), with X ~ P*. 
Finally, to estimate the maximum 7* = 71 for the deterministic case we 
apply ( 6 . 7 ). 

Stopping Rules 

The standard stopping criterion ( 4 . 10 ) does not apply to noisy optimization 
problems, since typically % will not converge to a fixed value. However, we 
can still use alternative stopping criteria such as ( 4 . 30 ) and ( 4 . 43 ), which are 
applicable to both the deterministic and the noisy case and are based on the 
convergence of a sequence such as {Pt} to the associated degenerated case. 

Another possibility is to stop the algorithm when {72^} has reached sta- 
tionarity. There exist an extensive literature on stationarity detection of a 
stochastic process. We present here two simple rules of thumb, involving mov- 
ing averages of {72^} and the fact that for some t>T the first two moments 
of {721} settle down to some constant values, say Ci and C2, respectively, 
that is E72t = Ci, and Var(72t) = C2. Our first stopping criterion is based on 
E72t = Cl, while our second stopping criterion is based on both E72t = Ci 
and Var(72t) = C2. 




6.3 Optimal Buffer Allocation in a Simulation-Based Environment [9] 207 



First Stopping Criterion 

To identify T, we consider the following moving-average process 
1 ^ ^ 

^ t = s,s + l,..., s>k , (6.8) 

s=t-k-\-l 

where k is fixed, say A: = 10. Define next 



Bt{k,r)= min Bt+jik) (6.9) 

and 

B^{k,r)= max BiJ^jik) , (6.10) 

j=l,...,r 

respectively, where r is fixed, say r = 5. Clearly, for t > T and moderate k and 
r, we may expect that B:^{k,r) ^ B^{k,r)y provided the variance of { 72 ^} is 
bounded. 

With this at hand, we define our first stopping criterion as 



T = 



min 




Bt{k,r)-Brik,r) ^ 

Br(k,r) - J ’ 



where k and r are fixed and e is a small number, say £ < 0.01. 



( 6 . 11 ) 



Second Stopping Criterion 



Our second stopping criterion is the same as (6.11) with Bt(k,r) replaced by 



Ct{k) = 



k-i {X^s=t-fc+i[725 Bt{k)]^^ 



( 6 . 12 ) 



Note that Ct(k) in (6.12) and Bt{k) in (6.11) represent in fact the sample SCV 
and the sample mean of moving averages, respectively. Although the second 
criterion is more time consuming, it is typically more accurate than the first 
one. 

Onc^the algorithm stops at time T, say for the noisy TSP, we take the 
matrix Pt in Algorithm 4.7.3 as an estimate of the ODTM P*. An alternative 
is to first approximate Pt by a degenerated one, say P^, and then take the 
latter as an estimate of P*. 



6.3 Optimal Buffer Allocation in a Simulation-Based 
Environment [9] 

Here we apply the noisy version of Algorithm 4.2.1 to the buffer allocation 
problem (BAP), a well-known difficult problem in the design of production 
lines. 




208 6 Noisy Optimization with CE 

The basic setting of the BAP is the following (see Figure 6.1): Consider 
a production line consisting of m machines in series, numbered 1,2, ... ,m. 
Jobs are processed by all machines in consecutive order. The processing time 
at machine i has a fixed distribution with rate Hi (hence the mean processing 
time is l//ii), i = 1, . . . , m. The machines are assumed to be unreliable, with 
exponential life and repair times. Specifically, machine i has failure rate Pi and 
repair rate r^, i = 1, . . . , m. All life, repair and processing times are assumed 
to be independent of each other. 

The machines are separated by m — 1 storage areas, or niches, in which jobs 
can be stored. However, the total number of storage places, or buffer places, 
is limited to n. When a machine breaks down, this can have consequences for 
other machines up or down the production line. In particular, an upstream 
machine could become blocked (when it cannot pass a processed job on to 
the next machine or buffer) and a downstream machine could become starved 
(when no jobs are offered to this machine). We assume that a starved or 
blocked machine has the same failure rate as a “busy” machine. The first 
machine in the line is never starved and the last machine is never blocked. 



niche 1 niche 2 niche 3 

M 1 M 2 M4 

Fig. 6.1. A production line with m = 4 machines. The total available buffer space is 
n = 9. The current buffer allocation is (3,2,4). Machine 1 has an infinite supply, but 
is currently blocked. Machine 2 has failed and is under repair. Machine 3 is starved. 
Machine 4 is never blocked. 




r ^ 

L J 








The BAP deals with the optimal allocation of n buffers among m— 1 niches. 
Here “optimal” refers to some performance measure of the fiowline. Typi- 
cal performance measures are the steady-state throughput and the expected 
amount of work-in-process (WIP). We shall only deal with the steady-state 
throughput. Note that there are 



n -f m — 2 
m — 2 



possible buffer allocations. 

There are two reasons why the BAP might be considered to be a difficult 
optimization problem. The first reason is that BAP presents a combinatorial 
optimization problem with elements which even for moderate values 

of n and m is quite large. The second reason is that, for a given buffer alloca- 
tion, the exact value of the objective function — the steady-state throughput 





6.3 Optimal Buffer Allocation in a Simulation-Based Environment [9] 



209 



— is often difficult or impossible to calculate analytically. In fact, the objec- 
tive function is only available for relatively simple production lines in which 
the processing times have exponential or (simple) phase-type distributions 
[61, 81]. So, in a more general setting the BAP is a noisy or simulation-based 
optimization problem, where the objective function needs to be estimated via 
simulation [126, 148]. 

We will use the following mathematical formulation of the BAP. Each 
possible buffer allocation (BA) will be represented by a vector x = (xi, . . . , 
Xm-i) in the set 

{ m-l 
(x^ , . . . , • Xi G "(0, 1, ... , 77/} , i — 1, . . . , 777/ 1, ^ ^ Xi = 77/ 

i=l 

Here Xi represents the number of buffer spaces allocated to niche i^ i = 
1, ... ,771 — 1. 

For each buflPer allocation x let 5(x) denote the expected steady-state 
throughput of the production line. Thus the BAP can be formulated as the 
optimization problem (6.3): 

maximize 5(x) over x G A". (6.13) 

As mentioned before, we assume that 5(x) is not available analytically and 
thus must be estimated via simulation. We denote by 5(x) the estimate of 
5(x). 

In order to apply the CE algorithm we need to specify (a) how to generate 
random buffer allocations, and (b) how to update the parameters at each 
iteration. 

We present below two algorithms for random buffer allocation genera- 
tion: the first is based on a multivariate discrete distributions with dependent 
marginals, and the second is based on generating random partitions. 

First Random Buffer Allocation Algorithm 

A simple method to generate a random vector X = (Xi, . . . ,Xm-i) in X 
is to first draw Xi, X 2 , . . . , independently, each Xi according to an 

{n -f l)-point discrete distribution (p^o? . . . ,Pin)j ^ = 1, • • • , — 1, and accept 

the sample only if Xi H h Xm-i = n. Prom Sections 4.3 and 6.2 we see 

immediately that the (noisy) updating formulas are given by 

N 
k=l 

N 

53 -^{s(x,)>7a 

k=l 



Pt,ij — 



(6.14) 




210 6 Noisy Optimization with CE 



where X^i = j means that j space is allocated to niche i at replication (sample) 
k. Note that (6.14) presents the fraction of times that Xt has an estimated 
performance greater than % and have their i-th coordinate^equal to j. For 
convenience we amalgamate for each t the pt,ij into a matrix Pf. The following 
algorithm generates random buffer allocations according to the conditional 
distribution of independent X above, given that Yli = n. 

Algorithm 6.3.1 (Generation of Buffer Allocations) 

For given P = (pij), generate a random permutation (tti, . . . , tt^-i) of 

Let k = 0 

For 2 = 1, . . . ,m — 1 
Let t = X/j=o 

For j = 0, . . , ,n - k let p^,j = p^nj/t 
For j = n — k let p^^.j = 0 

Generate according to {pTn,0i • • • jP 7 ri,n)’ 

Let k = k Xt^. 

End 

Algorithm 6.3.1 is further illustrated in Figure 6.2. 




Fig. 6.2. Generation of the BA vector (2,4,2, 1), for the case m = 5, n = 9 and 
the permutation tt = (2,3, 1,4). For the second niche there are initially 9 possible 
buffer places; 4 buffer places are allocated. This reduces the number of available 
buffer places for the third niche to 5; 2 buffer places are allocated, etc. 



To complete the algorithm, we need ^o specify the initialization and stop- 
ping conditions. For the initial matrix Pq we simply take all elements equal 
to l/(n + 1). The stopping criterion is based on the convergence of the se- 
quence of matrices Pq, Pi, . . . (see also Sections 4.2 and 4.3.2), which is found 
to converge to a degenerate matrix P*, that is, a matrix in which each row 
has exactly 1 one and n zeros. Specifically, the algorithm is terminated if for 
some integer d, for example, d = 5, 



6.3 Optimal Buffer Allocation in a Simulation-Based Environment [9] 211 



= • • • = for alH = 1, . . . ,m - 1 , (6.15) 

where denotes the index of the maximal element of the i-th row of Pt. 

Second Buffer Allocation Algorithm 

An alternative method for random buffer allocations is based on the biparti- 
tion problem in Section 4.6. Specifically, one can view the BAP as a “network” 
containing n + m — 2 nodes, which must be partitioned into two mutually dis- 
joint sets, where the first and the second set contain exactly m—2 and n nodes, 
respectively. Similarly, in the urn model terminology, it could be viewed as the 
problem of allocating (distributing) m — 2 balls into to n+m — 2 different cells. 
This leads to Algorithm 4.6.1 with the corresponding updating rules given in 
(4.24). 

Consider for example the case m = 5 (4 niches) and n = 9 buffer spaces 
as in Figure 6.2. Any partition of {1,2,...,12} into two groups of 3 and 9 
nodes corresponds to exactly one partition. Think of 3 balls, which we can 
place at any of 12 positions. The partition described by the binary vector 
(0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0) corresponds to the buffer allocation (2,4,2, 1). 

If not stated otherwise we shall use the first algorithm below. For easy 
reference we summarize the main Algorithm 4.2.1 for the noisy BAP as follows: 

Algorithm 6.3.2 (Main Algorithm for the noisy BAP) 

1. Choose Pq such that all elements are equal to l/(n + 1). Set t = 1. 

2. Generate a sample of buffer allocations Xi,...,Xtv according to Algo- 
rithm 6.3.1, with P = Pt-i, and compute the (1 — g)-quantile % of the 
estimated throughputs according to (6.4). 

3. Using the same sample, update Pt according to (6.14) and then smooth it 
out using (4.9). 

4- If for some t ^ d convergence condition (6.15) is met, then stop; otherwise 
set t = t-\- 1 and reiterate from Step 2. 

For the deterministic BAP, that is, the problem (6.13), the only change in 
Algorithm 6.3.2 is that Step 2 is replaced by 

2’. ... the actual throughputs according to (6.4). 

It is intuitively clear that the noisy BAP converges to the deterministic 
BAP if we decrease the relative error of the estimated throughputs to 0. 

Simulation Issues 

Recall that our approach involves two types of randomness: the natural one as- 
sociated with steady-state simulation of the production line and the artificial 
one associated with the probability matrix P, which we purposely introduced 




212 6 Noisy Optimization with CE 



to generate random allocation using Algorithm 6.3.1. The following discus- 
sion deals with the first type. Note that although steady-state simulation of 
BAP is time consuming and it takes up, in fact, most of the CPU time of 
Algorithm 6.3.2, our main goal here is to show how to incorporate efficiently 
the steady-state output data in a simulation-based optimization framework 
in BAP using CE rather than deal with efficient steady-state simulation tech- 
niques by themselves. Clearly, use of efficient variance reduction techniques, 
such as importance sampling, control variables, etc., can speed up Algorithm 
6.3.2 dramatically. These issues will be addressed elsewhere. It is worth men- 
tioning that the simulation model presents a “black box” for the CE opti- 
mization procedure. Any other method which calculates or approximates the 
expected steady-state objective function for a given buffer allocation can be 
used as well. 

At this juncture we mention several standard rules for simulation-based 
systems [148] that should be considered when implementing the CE method. 
Since for any buffer allocation (with a total of the expected steady- 

state performance (like WIP and throughput) is unknown, one has to allow 
some warmup (terminating simulation) before collecting the statistical data. 

One typically starts simulating a vacant system, that is, when all the 
buffers in the production line are empty and all machines are idle. After the 
first job output the buffers are still relatively empty, thus the throughput is 
high. As time progresses the WIP and the throughput converge to their steady 
state. As soon as the process has reached steady state, we start collecting 
the statistical data disregarding those from the warmup period. We run a 
single (long) simulation, called the batch mean [148], in order to estimate the 
unknown expected performance. To identify (approximate) the steady state we 
collect the output transient data in blocks (each of, say, 50 jobs) and perform 
a statistical test for each such block, using the means and variances of all the 
subsequent blocks in the same simulation until the relative error changes very 
little from block to block. We found empirically that for a configuration not 
exceeding 12 machines and 45 buffer spaces a warmup period of 100 jobs is 
quite sufficient. Clearly, with the system at steady state for an output process, 
such as throughput, one can start collecting the output data simultaneously 
for any other performance, such as WIP etc [148]. In our numerical studies 
we selected the size B of the batches as a function of the relative error of the 
underlying output process. 

More specifically, for the percentage relative error k x 100% = 0.3, 0.5, 1.0 
we took B = 300, 600, 2000, respectively. A documented Mat lab source code 
file is available at 

http : //iew3 . technion . ac . il/~talraviv/Publications/buf f er . zip 




6.4 Numerical results for the BAP 213 



6.4 Numerical results for the BAP 



To evaluate the effectiveness of Algorithm 6.3.2 we applied it to various test 
problems. Specifically, we applied Algorithm 6.3.2 to a suite of 70 test cases 
in Vouros and Papadopolous [168]. In all these cases the machine processing 
times have exponential or Erlang2 distributions. Since, in addition, the life and 
repair times are assumed to be exponentially distributed, we can in principle 
calculate the exact optimal buffer allocation and corresponding steady-state 
throughput for these systems, using Markov chain theory, as described in [81]. 
It should be noted, however, that the solutions are in practice only obtainable 
for relatively small n and m. In addition to the 70 test cases, we applied 
Algorithm 6.3.2 to various relatively large systems for which the solutions 
were not available from [168]. In this section we summarize the results on a 
selection of these test problems. 

In all test cases below we set g = 0.1, a = 0.7 (smoothing parameter), 
took d = 5 in our stopping rule (6.15) and generated at each iteration N = 
2mn random buffer allocations (we need to estimate the components of the 
(m — l)x(n-hl) matrices so we need at least mn replications). Similar 
results where obtained with 0.05 < ^ < 0.2 and 0.5 < a < 0.95. The algorithm 
was implemented in Matlab 5.2 and ran on an Intel Pentium III 500MHz 
processor. For a given buffer allocation we used the batch means method [148] 
to estimate the steady-state throughput, each simulation run starting with a 
sufficiently long warmup period. 

For each test case we generated 10 independent solutions via Algorithm 
6.3.2, say i = 1, . . . , 10. These were compared with either the optimal 
solution (steady-state output) 7*, or with the best known solution 7"^. In the 
tables below, we use the following notation: The average relative experimental 
error of the 10 solutions is defined either as 



1 

= -T 

10 

Z=1 



W| 



7 



or as 



1 



ly 



( 0 | 
7t I 



(6.16) 



depending on whether the true optimal solution is known or not. Also, below 
7 t denotes the average of the 10 generated solutions, and and e* denote 
the worst and the best relative experimental error among the 10 generated 
solutions. Here, we take again 7^ instead of 7* when the optimal solution is 
not known. Finally, BA denotes the optimal buffer allocation vector, and T 
and CPU denote the average total number of iterations needed before stopping 
and the average CPU time in seconds, respectively. 

Tables 6.1, 6.2 and 6.3 present the results for a number of test cases in 
Vouros and Papadopolous [168]. In particular, in Tables 6.1 and 6.2 we con- 
sider systems with exponential processing times with rates iii^i = 1, . . . ,m, 
and in Table 6.3 we consider systems with Erlang2 processing times, with 
rates = 1, . . . ,m; thus, for each machine i the processing time consists 
of two exponential phases with rates 2/Xi. We recall that the machine life and 




214 6 Noisy Optimization with CE 



repair times are assumed to be exponential with rates jSi^i = 1 , . . . ,m and 
Ti^i = respectively. We see that the allocations found by the CE 

method are very close to the exact optimal ones (7*) in [168]. 

Tables 6.4 and 6.5 present the performance of Algorithm 6.3.2 for m = 6 
and m = 10, respectively, with exponential processing times and different 
values of n. We could not compare the results of Tables 6.4 and 6.5 with any 
alternatives since to the best of our knowledge no case studies are available 
yet for such relatively large systems. We argue, however, that our results are 
accurate and reliable and could serve as case studies to compare different 
algorithms. Note also that 7^ in Tables 6.4 and 6.5 corresponds to our best 
solution obtained (on the basis of 10 different runs) for each fixed n. 

We obtained similar accuracies for different processing time distributions 
(that is, exponential, normal. Erlang, uniform and deterministic), provided 
0.05 < ^ < 0.2 and 0.5 < a < 0.95. 



Table 6.1. Performance of Algorithm 6.3.2 for BAPs with m — 1 = 2 niches and 
different values of n, exponential processing times with rates fii = 1, /X 2 = 1-2, 
fjLs = 1.4, failure rates /3i = 0.05, and repair rates ri = 0.5, i = 1, . . . , 3. 



n 


T 


BA 


7t 


7* 


£ 




£* 


CPU 


1 


2.0 


(1,0) 


.6341 


.6341 


0.0000 


0.0000 


0 


7 


2 


2.0 


(1,1) 


.6715 


.6744 


0.0044 


0.0088 


0 


8 


3 


2.6 


(2,1) 


.6998 


.7113 


0.0164 


0.0590 


0 


9 


4 


3.5 


(3,1) 


.7349 


.7361 


0.0016 


0.0054 


0 


14 


5 


3.8 


(3,2) 


.7574 


.7587 


0.0018 


0.0059 


0 


14 


6 


4.3 


(4,2) 


.7688 


.7777 


0.0037 


0.0211 


0 


16 


7 


6.2 


(5,2) 


.7811 


.7922 


0.0052 


0.0171 


0 


22 


8 


5.1 


(5,3) 


.8040 


.8060 


0.0025 


0.0084 


0 


20 


9 


9.1 


(6,3) 


.8142 


.8178 


0.0044 


0.0163 


0 


32 


10 


8.3 


(7,3) 


.8255 


.8274 


0.0024 


0.0095 


0 


30 



Table 6.2. Performance of Algorithm 6.3.2 for m — 1 = 4 niches, different values 
of n, exponential processing times with rates fii = 1, = 1-1, fi 3 = 1.2, 114 = 1.3, 

IJL 3 = 1.5, failure rates Pi = 0.05, and repair rates n = 0.5, i = 1, . . . , 5. 



n 


T 


BA 


7t 


7* 


£ 


s* 


£* 


CPU 


1 


2.6 


(0, 1,0,0) 


.5213 


.5213 


0.0000 


0.0000 


0 


12 


2 


4.6 


(1,1,0,0) 


.5479 


.5514 


0.0060 


0.0110 


0 


32 


3 


3.6 


(1,1,1,0) 


.5824 


.5824 


0.0000 


0.0000 


0 


39 


4 


6.4 


(1,2,1,0) 


.6015 


.6027 


0.0020 


0.0085 


0 


67 


5 


9.0 


(2,2, 1,0) 


.6202 


.6213 


0.0018 


0.0032 


0 


103 


6 


5.7 


(2,2,1,1) 


.6420 


.6422 


0.0003 


0.0031 


0 


89 


7 


7.7 


(2,2,2,!) 


.6572 


.6585 


0.0020 


0.0087 


0 


116 


8 


7.2 


(3,2,2,1) 


.6731 


.6744 


0.0020 


0.0120 


0 


132 


9 


9.1 


(3,3,2, 1) 


.6885 


.6894 


0.0013 


0.0067 


0 


166 


10 


10.7 


(3,3,3,1) 


.7004 


.7005 


0.0002 


0.0003 


0 


197 




6.4 Numerical results for the BAP 215 



Table 6.3. Performance of Algorithm 6.3.2 for m — 1 = 4 niches, with Erlang2 
processing times with rates m = I, fX 2 = 11, Ms = 1.2, /Z4 = 1.3, /xs = 1-5, failure 
rates pi = 0.05, and repair rates n = 0.5, i = 1, . . . , 5. 



■ 




BA 


IBB 




e 


e* 




CPU 


1 


QJI 






.5968 




0.0000 


0 


23 


2 




li jBBGil 


.6331 


.6338 


0.0011 


0.0114 


0 


39 


3 


3.9 


lllililGil 


.5824 


.5824 




0.0000 0 


0 


55 


4 


5.8 


(2, 1,1,0) 


.6802 


.6808 






0 


86 


5 


8.3 


(2,2,1,0) 


.6985 


.6996 


0.0016 




0 


159 


6 


6.9 


(2,2,1,1) 




.7195 


0.0114 


0.0020 


0 


187 




12.5 


(3,2,2,!) 


.7335 


.7341 


0.0018 


0.0007 


0 


202 




9.8 


(3,2,2,1) 




.7501 


0.0043 


0.0007 


0 


181 




9.7 


(3,3, 2,1) 




.7627 


0.0068 


0.0009 


0 


177 




13.6 


(4,3,2,1) 


.7714 


7740 


0.0124 


0.0330 


0 


261 



Table 6.4. Performance of Algorithm 6.3.2 for m — 1 = 5 niches and various n, 
exponential processing times with rates = 8, /X2 = 11, M3 = 14, M4 = 14, M5 = H, 
= 8, failure rates Pi = 0.05, and repair rates Vi = 0.5, t = 1, . . . , 6. 



n 


f 


BA 


7T 


7^ 


e 


e* 




CPU 


2 


4.2 


(1,0,0,0,1) 


5.4935 


5.5027 


0.0017 


0.0084 


0 


33 


4 


5.4 


(1,0,0, 0,1) 


5.9245 


5.9334 


0.0015 


0.0076 


0 


66 


6 


12 


(1,1, 0,1,1) 


6.2443 


6.2555 


0.0018 


0.0050 


0 


156 


8 


13.4 


(2,1,0,1,2) 


6.5197 


6.5253 


0.0009 


0.0022 


0 


199 


10 


25.6 


(3,1,1,1,2) 


6.7510 


6.7589 


0.0012 


0.0057 


0 


386 


12 


49 


(4,2,1, 2,3) 


6.9316 


6.9360 


0.0006 


0.0011 


0 


766 


14 


28.2 


(4,2,1, 2,5) 


7.0684 


7.0934 


0.0035 


0.0079 


0 


604 


16 


59.2 


(5,2,2,2,5) 


7.1783 


7.1846 


0.0009 


0.0026 


0 


1128 


18 


92.6 


(6,2,2,3,5) 


7.4149 


7.4291 


0.0019 


0.0036 


0 


2048 



Table 6.5. Performance of Algorithm 6.3.2 for m — 1 = 9 niches, exponential 
processing times with rates /xi = 8, /X2 = 8, /xa = H> M4 = 14, /X5 = 14, /xe = H, 
M7 = 8, /X8 = 8, /X9 = 6, /xio = 6, failure rates Pi = 0.05, and repair rates r* = 0.5, 
i= 1,...,10. 



n 


T 


BA 


7r 


7T 


£ 




e* 


CPU 


2 


4.00 


(0,0,0,0,0,0,0,1,1) 


3.8281 


3.8749 






0 


110 


4 


11.67 


(0,0,0,0,0,0,1,1,2) 


4.1160 


4.1236 






0 


402 


6 


19.67 


(0,1,0, 0,0, 0,1,2, 2) 


4.3220 


4.3289 






0 


964 


8 


23.33 


(0,1,0,0,0,1,1,2,3) 


4.5325 




RRffiil 




0 


1200 


10 


12.67 


(1,1,0,0,0,1,2,2,3) 


4.6146 


4.6426 


0.0060 


8RK 


0 


1165 


12 


14.33 


(1,1,0,0,0,1,2,3,4) 


4.7814 


4.7946 


0.0028 


8^^ 


0 


1719 


14 


37.00 


(1,1,0,0,1,1,2,3,5) 


4.8852 


4.8895 


0.0008 


8^^ 


0 


3325 


16 


49.33 


(1,1,0,1,0,2,2,4,5) 


4.9832 


4.9891 


0.0018 


8fiii)S 


0 


5117 


18 


186.67 


(2,1,1,0,0,1,3,4,6) 


5.0414 




0.0045 


0.0117 


0 


20714 


























216 6 Noisy Optimization with CE 

Dynamics 



We illustrate the dynamics of the matrices Pt for a benchmark problem with 
4 niches, 10 buffer spaces, normally distributed processing times with /i = 
6, cr = 2, and N = 80. 

( 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 \ 

0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 1 

0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 I ' 

0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 0.0909 / 



( 0.0002 0.0013 0.0139 0.4484 0.5349 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 \ 

0.0002 0.0002 0.0303 0.8226 0.1432 0.0014 0.0013 0.0002 0.0002 0.0002 0.0002 ] 

0.0002 0.0483 0.9410 0.0089 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 I 

0.0014 0.6007 0.3913 0.0051 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 / 



( 0.0000 0.0000 0.0038 0.0179 0.9783 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 \ 

0.0000 0.0000 0.0001 0.9996 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 | 

0.0000 0.0001 0.9445 0.0554 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 I ' 

0.0000 0.9801 0.0199 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 / 

It follows from the results above that starting from Pq with the elements 
1/ (n+1) = 1/11 = 0.0909 Algorithm 6.3.2 stopped after 9 iterations allocating 
4, 3, 2, and 1 buffer spaces to niches 1, 2, 3, and 4, respectively. 

Our numerical studies suggest that the proposed algorithm is fast and 
typically performs well, in the sense that in approximately 99% of the cases 
the relative experimental error e does not exceed 0.01. 

Further topics for investigation include (a) establishing convergence of Al- 
gorithm 6.3.2 for finite sampling (that is, N < oo) with emphasis on the com- 
plexity and the speed of convergence under the suggested stopping rules; (b) 
establishing confi^toce intervals (regions) for the optimal solution; (c) appli- 
cation of parallel optimization techniques to the proposed methodology; and 
(d) investigations regarding a further speed-up of the algorithm. With respect 
to (d), we note that initially the throughputs do not need to be estimated 
very accurately, since the procedure just needs a rough idea which buffer al- 
locations are good or not. However, furtheron in the procedure the accuracy 
needs to be increased to distinguish between competing “good” solutions. In 
the present test cases the same accuracy was used for all iterations, since the 
goal of this section was to show that for the “noisy” BAP high accuracy could 
be achieved within a reasonable time. 



6.5 Numerical Results for the Noisy TSP 

In this section we present the performance of Algorithm 4.7.3 for both de- 
terministic and stochastic (noisy) versions of the travelling salesman problem 




6.5 Numerical Results for the Noisy TSP 217 



(TSP) respectively, using the formalism of Section 6.2. If not stated otherwise 
we use the indicator reward function /{s'(x)^7} (or ^{§(x)^7} noisy case) 

in our traditional two-stage procedure for updating the sequence {(7t,Pt)}, 
(for alternative reward functions see Section 5.2). 

To generate different case studies we choose the elements Cij of the cost 
matrix C = {cij) as i.i.d. random variables distributed according to some 
specified distribution. Once the matrix C is fixed we generate at each stage 
of the CE algorithm the Yij according to another specified distribution, with 
expectation Cij. In all examples, we set g = 0.01 and, depending on whether 
the deterministic or stochastic (noisy) version is considered, we stopped Al- 
gorithm 4.7.3 either according to the stopping rule (4.10) or according to the 
stopping rule (6.11). We also used stopping rule (4.43), which is suitable for 
both the deterministic and the stochastic (noisy) problems. In all cases (see 
(4.10), (6.11) and (4.43)) we took d = k = b. We also present the CPU time 
of Algorithm 4.7.3 implemented in Matlab on 1.4GHz processor with 256M 
RAM along with the relative experimental error, defined as 



s = 



h* - 7 t | 



(6.17) 



provided 7* 0, where 7^ is the estimate of 7* obtained after stopping. As 

estimates of the unknown parameters ju and j 2 t (see Section 6.2) we take 7it 
and 72t respectively, where indices 1 and 2 correspond to the deteministic and 
stochastic versions, respectively. To indicate the convergence of Pt to ODTM 
P* we also present the following quantities: 



pmm 



min max 

l<2<n 



Pt,ij ? 



(6.18) 



and 

Pt,mm = min pt^ij . (6.19) 

l<2<n l<j<n 

They correspond to the min max and max min values of the elements of the 
matrix Pt = at iteration t, respectively. It is readily seen that P^P''^ = 

1 if and only if Pt = P*> We use the notations pp'^ and Pp'^ for the 
deterministic and the stochastic versions, respectively. 

Consider the synthetic TSP of Section 4.10.1 with optimal tour 1 ^ 2 -> 
3 ^ 1 with total cost 7* = n. 

We consider separately (a) deterministic and (b) stochastic TSP. 



(a) Deterministic TSP 

Table 6.6 presents the relative experimental errors Sk, A: = 1 , ... ,4 (and the 
associated stopping times T^, A; = 1, . . . , 4) as a function of the sample size 
N = m? = r2500 for n = 50 and the following four cases of Cij\ ci^ij ~ 
U(2, 10); C 2 ^ij = 2; c^^ij ~ U(9, 10), and 04,^^- = 9. 




218 6 Noisy Optimization with CE 



Table 6.6. The relative experimental errors Si, i = (and the associated 

stopping times Ti, i = 1, . . . , 4) as a function of the sample size N = rv? = r 2500 
for the four cases of Cij above. 



N 




S2(T2) 


£3(T3) 


£^ 4 ( 74 ) 


= 2500 


0.104 (29.3) 


0.0 (16.3) 


0.0 (15.8) 


0.0 (16) 


3n^ = 7500 


0.0 (18) 


0.0 (13.3) 


0.0 (13.5) 


0.0 (13.8) 


5n^ = 12500 


0.0 (17.5) 


0.0 (13.8) 


0.0 (13.3) 


0.0 (13) 


lOn^ = 25000 


0.0 (17) 


0.0 (13) 


0.0 (13) 


0.0 (13) 



Table 6.7 presents the associated CPU times, denoted CPUi, . . . ,CPU 4 , 
as a function of the sample size N. 



Table 6.7. The CPU times as a function of the sample size N. 



N 


CPUl 


CPU2 


CPUs 


CPU4 


= 2500 


48.5 


27.0 


26.5 


26.8 


3n^ = 7500 


92.0 


64.3 


66.5 


67.8 


= 12500 


143.0 


111.3 


108.3 


105.5 


lOn^ = 25000 


279.5 


205.5 


205.0 


204.8 



It is seen from Table 6.6 that for fixed N, Ti > T 3 and T 2 > T^. The 
reason is that the lengths of all tours in cases 1 and 2 are “closer” each to 
other than in the alternative ones 3 and 4, respectively. As result, convergence 
of ( 7 t,Pt) to ( 7 *,P*) is slower in cases 1 and 2 as compared to the cases 3 
and 4, respectively. 

(b) Stochastic TSP 

The test cases for the noisy TSP were constructed by first generating the 
deterministic matrix (cij) exactly as in the deterministic case above, and then 
generate the Yij from a predescribed distribution with EYij = cij. Below we 
present numerical results for the same TSP of Section 4.10.1, with 7 * = n = 40 
and for the following six cases of Yij\ 

1 . Yij = Cij (deterministic case). 

2. Yij ^ Beta(10, 10, Cij - 1, Cij + 1). 

3. Yij ~ \}{cij - 0.5, Cij + 0.5). 

4. Yij ~ U(cij 1 , Cij ~h 1 ). 

5. Yij Po(cij). 

6. Yij ~ Exp(cyi). 




6.5 Numerical Results for the Noisy TSP 219 



In each of the six experiments we averaged the output statistics for the relative 
experimental errors z = 1, . . . , 6 and the associated stopping times i = 
1, . . . , 6 on the basis of five independent replications, each of size N = rn? = 
rl600, r > 1. We found that as r ^ 10 the (average) relative experimental 
error equals zero, that is Algorithm 4.7.3 is exact. 

Table 6.8 presents the average relative experimental errors €k, fc = l,...,6 
(and the associated stopping times T^, k = 1,...,6) as a function of the 
sample size N = rm? =r 1600 for n = 40 and the six cases of Yij above. Here, 
for the deterministic network {Yij = Cij) we used the stopping rule (4.10), 
while for the remaining five stochastic networks we used the stopping rule 
( 6 . 11 ). 



Table 6.8. The average relative experimental errors Si, i = 1,...,6 (and the 
associated stopping times Ti, i = 1,...,6) as a function of the sample size 
N = rn^ = r 1600 for the six cases of Yij above. 



N 


eiin) 


£2(T2) 


SaiTs) 






ee{n) 


8000 

16000 

32000 


0.006 (24.3) 
0.000 (21.8) 
0.000 (20.5) 


0.029(38) 

0.007(35) 

0.000(34) 


0.037 (36.5) 
0.007 (35) 
0.000 (35.3) 


0.040 (42.5) 
0.019 (42) 
0.000 (40) 


0.053 (62.8) 
0.025 (61.5) 
0.013 (57) 


0.058 (89.5) 
0.034 (90) 
0.013 (88.8) 



Table 6.9 presents the associated average CPU times, denoted CPUi, . . . , 
CPUe, as a function of the sample size N. 



Table 6.9. The average CPU times as a function of the sample size N. 



N 


CPUi 


CPU 2 


CPUs 


CPU 4 


CPUs 


CPUs 


8000 


81.3 


266.8 


157.3 


176.0 


286.0 


326.8 


16000 


138.5 


478.0 


294.8 


347.8 


575.8 


839.0 


32000 


262.5 


960.5 


581.3 


667.5 


1081.3 


1341.3 



It is seen in Table 6.8 that for fixed N the accuracy of Algorithm 4.7.3 
decreases as the variance (noise) of Yij increases. This is in agreement with 
the fact that Sk, k = 1,...,6 in Table 6.8 are arranged purposely in the 
nonincreasing order, that is, ^ £2 ^ ^ ^6 sind this arrangement matches 

with the corresponding variances of Yk^ij = 1, . . . , 6 in the sense that 



0 = Ya,T{Yi^ij) ^ Var(F2,ij) ^ ^ Var(l6,ij) • 




220 6 Noisy Optimization with CE 



We also found that for small samples (r ^ 2), Algorithm 4.7.3 does not 
converge for most cases. 

Tables 6.10 and 6.11 present data similar to Tables 6.8 and 6.9, respectively 
using the same stopping rule (4.43) for both the deterministic (Yij = Cij) and 
stochastic networks. We see that the results of Tables 6.10 and 6.11, and 
Tables 6.8 and 6.9 are nearly the same. 



Table 6.10. Data similar to Table 6.8 using stopping rule (4.43). 



N 


Si(Ti) 


£ 2 ( 12 ) 


£ 3 ( 73 ) 


£4(T4) 


£5(T5) 


^6{Tg) 


8000 

16000 

32000 


0.008 ( 23 . 7 ) 
0.000 ( 23 . 5 ) 
0.000 ( 21 . 3 ) 


0.011 ( 23 ) 
0.007 ( 22 . 5 ) 
0.000 ( 21 . 5 ) 


0.016 ( 24 . 5 ) 
0.015 ( 23 . 5 ) 
0 . 000 ( 23 . 3 ) 


0.037 ( 29 ) 
0.023 ( 29 . 8 ) 
0.007 ( 25 . 8 ) 


0.038 ( 45 . 8 ) 
0.030 ( 43 . 3 ) 
0.009 ( 43 . 8 ) 


0.045 ( 71 . 8 ) 
0.031 ( 67 ) 
0.013 ( 67 . 5 ) 



Table 6.11. Data similar to Table 6.9 using stopping rule (4.43). 



N 


CPUi 


CPU 2 


CPUs 


CPU 4 


CPUs 


CPUs 


8000 


79.5 


160.0 


101.8 


122.0 


214.5 


263.5 


16000 


150.0 


314.5 


196.0 


247.8 


409.5 


488.8 


32000 


272.5 


605.5 


431.8 


431.8 


830.5 


990.3 



Table 6.12 presents the dynamics of Algorithm 4.7.3 for cases 1 and 4 
associated with Table 6.10, using the same stopping rule (4.43) with n = 40. 
More precisely. Table 6.12 presents the comparative evolution of the shortest 
tour estimates 71* for the deterministic case {Yij = Cij) (using A^i = 8000), 
and 72t for the noisy case with Yij U(c,,- - 1, Cij + 1) (using N 2 = 16000). 

It also displays the evolution of and The deterministic and noisy 

versions of the algorithm stopped after 22 and 44 iterations and both found 
the shortest tour (with 7* = 40) exactly. The CPU times were 125 and 383 
seconds for the deterministic and noisy case, respectively. Figure 6.3 presents 
7it and j 2 t as functions of iteration t based on the data of Table 6.12. 




6.5 Numerical Results for the Noisy TSP 221 



Table 6.12. Comparative evolution of Algorithm 4.7.3 for the deterministic and the 
noisy (U(— 1, -fl)) case with n = 40 and with sizes of the sample equal to Ni = 8000 
and N 2 = 16000, respectively. 



t 


7it 


72t 


Tjmm 


rymm 

^2t 


1 


179.0520 


179.4279 


0.058578 


0.048116 


2 


144.0247 


145.0523 


0.076042 


0.065749 


3 


116.1935 


118.6788 


0.096642 


0.090130 


4 


97.05315 


99.11970 


0.129601 


0.098680 


5 


83.49589 


84.54807 


0.125402 


0.121932 


13 


49.87714 


53.80331 


0.256380 


0.192082 


14 


48.47305 


52.34546 


0.315090 


0.204799 


15 


45.63236 


51.81999 


0.381699 


0.208865 


16 


43.48742 


50.07529 


0.414624 


0.192094 


17 


40.00000 


48.48727 


0.925295 


0.212014 


18 


40.00000 


46.81393 


0.992530 


0.193051 


23 


40.00000 


40.55133 


0.999925 


0.324640 


24 


40.00000 


39.42279 


0.999925 


0.334615 


25 


40.00000 


38.12740 


0.999925 


0.397181 


26 


40.00000 


35.99771 


0.999925 


0.540564 


27 


40.00000 


34.15617 


0.999925 


0.821609 


28 


40.00000 


32.38640 


0.999925 


0.967815 


41 


40.00000 


31.46442 


0.999925 


1.000000 


42 


40.00000 


31.46441 


0.999925 


1.000000 


43 


40.00000 


31.57093 


0.999925 


1.000000 


44 


40.00000 


31.38555 


0.999925 


1.000000 



Fig. 6.3. 7 it and 72 t as function of t based on the data of Table 6.12. 
180 
160 
140 
120 
7f 100 
80 
60 
40 
20 

0 5 10 15 20 25 30 35 40 45 

t 





222 6 Noisy Optimization with CE 



Table 6.13 represents similar data as in Table 6.12, but now for the noisy 
case Yij ~ Exp(c“^). The sample sizes were N\ = 8000 and N 2 = 32000. The 
CPU times were 125 and 614 seconds for the deterministic and noisy cases, 
respectively. For the deterministic case, the shortest tour was found exactly. 
Note that Table 6.13 presents a situation when the variance of Yij is quite 
large. For this reason. Algorithm 4.7.3 stopped for the noisy case after 74 
iterations compared with 25 iterations for the deterministic case. 



Table 6.13. Comparative evolution of Algorithm 4.7.3 for the deterministic and the 
exponential noisy case with n = 40 and with sizes of the sample equal to Ni = 8000 
and N 2 = 32000, respectively. 



t 


7it 


72t 


rymm 


Tymm 

^2t 


1 


178.9196 


133.5925 


0.048438 


0.039064 


2 


145.1071 


121.1037 


0.073678 


0.047518 


3 


119.1800 


112.4635 


0.100546 


0.055369 


4 


98.29205 


103.1123 


0.130158 


0.060051 


5 


84.25539 


95.44152 


0.158822 


0.069481 


17 


46.01890 


57.40773 


0.321939 


0.131134 


18 


44.11950 


56.34756 


0.364391 


0.123450 


19 


42.54073 


55.62273 


0.351199 


0.136384 


20 


42.49261 


55.18718 


0.331856 


0.137117 


21 


41.39288 


54.24865 


0.542151 


0.146697 


22 


41.00371 


54.52926 


0.475954 


0.147014 


23 


40.00000 


53.50527 


0.926640 


0.150714 


24 


40.00000 


52.93636 


0.992664 


0.147023 


25 


40.00000 


51.94572 


0.999266 


0.149987 


26 


40.00000 


50.90347 


0.999266 


0.157964 


54 


40.00000 


35.40847 


0.999266 


0.500419 


55 


40.00000 


33.79186 


0.999266 


0.648272 


56 


40.00000 


31.33194 


0.999266 


0.807651 


57 


40.00000 


28.92427 


0.999266 


0.944733 


58 


40.00000 


27.44452 


0.999266 


0.977367 


59 


40.00000 


26.91028 


0.999266 


0.995102 


60 


40.00000 


26.94722 


0.999266 


0.999510 


61 


40.00000 


26.85958 


0.999266 


0.999951 


62 


40.00000 


26.72758 


0.999266 


0.999995 


63 


40.00000 


26.78825 


0.999266 


1.000000 


72 


40.00000 


26.76143 


0.999266 


1.000000 


73 


40.00000 


26.81044 


0.999266 


1.000000 


74 


40.00000 


26.66721 


0.999266 


1.000000 




6.5 Numerical Results for the Noisy TSP 223 



We also ran Algorithm 4.7.3 for the case studies in Table 2.5 in situations 
of noise, namely, we used the noisy cost matrices instead of the original deter- 
ministic ones. The noise was generated in a fashion similar to that generated 
in Tables 6.8 and 6.10. Tables 6.14 and 6.15 present such case studies for the 
the problems ftv47 and ftv55 in Table 2.5. 



Table 6.14. Case studies for the noisy TSP ftv47 with n = 48 and N = 10 n^. 



Yij 


T 


71 


7t 


7 


£ 


£* 




CPU 


Yij = Cij 


61 


5622 


1787 


1776 


0.018 


0.043 


0.006 


284.0 


Yij^Cij + U{-7,7) 


61 


5723 


1814 


1776 


0.035 


0.044 


0.021 


459.8 


Yij ~ Po(Cijf) 


60 


5732 


1817 


1776 


0.045 


0.068 


0.023 


2287.2 



Table 6.15. Case studies for the noisy TSP ftv55 with n = 56 and N = 10 n^. 



Yij 


T 


71 


7t 


7 


£ 


£* 




CPU 


Yij = Cij 


67 


6256 


1614 


1608 


0.006 


0.014 


0.004 


543.3 


Yij ~ Cij -h U(-5, 5) 


75 


6402 


1615 


1608 


0.029 


0.040 


0.004 


9809.9 


Yij ~ Po(cij) 


78 


6255 


1646 


1608 


0.064 


0.087 


0.024 


4412.2 



Similar to the results in Tables 6.8 and 6.10 we found that the relative ac- 
curacy remains approximately the same as for the deterministic case, provided 
the sample size is increased from 5n^ to 15 respectively. 

Finally, we compare the dynamics of the deterministic TSP at the end of 
Section 2.5.2 with its noisy counterpart. Thus, below we present Pt as function 
of the iteration number t for the synthetic noisy TSP with n=10 cities. We 
take Yij ~ U(cij — 1, + 1) and a sample size of W=2000. Note that for the 

deterministic case a sample size oi N = 500 was taken. 

Dynamics of Pt 



/ O.OO 


0.12 


0.12 


0.08 


0.17 


0.12 


0.08 


0.08 


0.17 


0.04 \ 


0.17 


0.00 


0.29 


0.08 


0.08 


0.08 


0.04 


0.08 


0.08 


0.08 


0.07 


0.07 


0.00 


0.32 


0.07 


0.04 


0.07 


0.14 


0.07 


0.14 


0.04 


0.21 


0.08 


0.00 


0.12 


0.12 


0.12 


0.08 


0.12 


0.08 


0.09 


0.05 


0.09 


0.09 


0.00 


0.27 


0.09 


0.09 


0.18 


0.05 


0.08 


0.08 


0.08 


0.04 


0.04 


0.00 


0.42 


0.08 


0.12 


0.04 


0.15 


0.23 


0.04 


0.08 


0.08 


0.04 


0.00 


0.15 


0.08 


0.15 


0.05 


0.10 


0.05 


0.15 


0.25 


0.05 


0.05 


0.00 


0.20 


0.10 


0.09 


0.05 


0.14 


0.14 


0.14 


0.09 


0.09 


0.05 


0.00 


0.23 


\ 0.25 


0.08 


0.04 


0.08 


0.08 


0.04 


0.04 


0.29 


0.08 


0 . 00 / 




224 6 Noisy Optimization with CE 



/O.OO 0.32 0.08 0.06 0.12 0.16 0.06 0.06 0.12 0.03 \ 

0.11 0.00 0.60 0.06 0.06 0.04 0.03 0.04 0.04 0.04 

0.05 0.05 0.00 0.52 0.05 0.02 0.05 0.07 0.05 0.15 

0.03 0.13 0.04 0.00 0.13 0.09 0.09 0.22 0.18 0.07 

g _ 0.04 0.03 0.06 0.06 0.00 0.47 0.06 0.08 0.16 0.04 

2 ” 0.06 0.04 0.06 0.03 0.03 0.00 0.67 0.06 0.04 0.03 

0.21 0.29 0.03 0.06 0.04 0.03 0.00 0.21 0.06 0.08 

0.04 0.04 0.04 0.09 0.31 0.04 0.04 0.00 0.31 0.09 

0.04 0.03 0.04 0.08 0.24 0.07 0.07 0.03 0.00 0.40 

\0.46 0.06 0.03 0.08 0.06 0.04 0.03 0.19 0.06 0.00/ 

/O.OO 0.62 0.05 0.04 0.08 0.07 0.04 0.04 0.04 0.02 \ 

0.08 0.00 0.72 0.04 0.04 0.03 0.02 0.03 0.03 0.03 

0.03 0.03 0.00 0.65 0.03 0.02 0.03 0.05 0.03 0.12 

0.08 0.09 0.03 0.00 0.23 0.15 0.06 0.08 0.23 0.05 

3 _ 0.04 0.02 0.05 0.05 0.00 0.54 0.05 0.04 0.17 0.04 

~ 0.04 0.03 0.04 0.02 0.02 0.00 0.77 0.04 0.03 0.02 

0.17 0.13 0.02 0.04 0.03 0.02 0.00 0.42 0.04 0.13 

0.03 0.03 0.04 0.06 0.37 0.03 0.03 0.00 0.37 0.04 

0.03 0.03 0.03 0.09 0.17 0.04 0.05 0.04 0.00 0.52 

\0.52 0.04 0.02 0.04 0.04 0.03 0.02 0.24 0.04 0.00/ 



/O.OO 0.92 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.00 \ 
0.02 0.00 0.93 0.01 0.01 0.01 0.00 0.01 0.01 0.01 
0.01 0.01 0.00 0.89 0.01 0.00 0.01 0.06 0.01 0.01 
0.01 0.01 0.00 0.00 0.88 0.04 0.01 0.02 0.02 0.01 
B _ 0.01 0.00 0.01 0.01 0.00 0.90 0.01 0.01 0.04 0.01 

® “ 0.01 0.01 0.01 0.01 0.01 0.00 0.94 0.01 0.01 0.01 

0.02 0.01 0.00 0.01 0.01 0.00 0.00 0.88 0.01 0.05 
0.01 0.01 0.01 0.01 0.04 0.00 0.00 0.00 0.92 0.01 
0.01 0.00 0.01 0.06 0.01 0.01 0.01 0.01 0.00 0.89 
\0.92 0.01 0.00 0.01 0.01 0.01 0.00 0.02 0.01 0.00/ 

/O.OO 0.93 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 \ 
0.02 0.00 0.90 0.01 0.05 0.01 0.00 0.01 0.01 0.01 
0.01 0.01 0.00 0.91 0.01 0.00 0.01 0.05 0.01 0.01 
0.01 0.01 0.00 0.00 0.88 0.05 0.01 0.01 0.02 0.01 
S _ 0.01 0.00 0.05 0.01 0.00 0.88 0.01 0.01 0.03 0.01 

“ 0.01 0.01 0.01 0.00 0.00 0.00 0.95 0.01 0.01 0.00 

0.02 0.01 0.00 0.01 0.00 0.00 0.00 0.90 0.01 0.04 
0.00 0.00 0.01 0.01 0.03 0.00 0.00 0.00 0.93 0.01 
0.00 0.00 0.00 0.05 0.01 0.01 0.01 0.01 0.00 0.91 
\0.93 0.01 0.00 0.01 0.01 0.01 0.00 0.02 0.01 0.00/ 

/O.OO 0.94 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 \ 
0.01 0.00 0.92 0.01 0.04 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.93 0.00 0.00 0.00 0.04 0.00 0.01 
0.01 0.01 0.00 0.00 0.91 0.04 0.01 0.01 0.02 0.00 
s _ 0.01 0.00 0.04 0.01 0.00 0.90 0.01 0.01 0.03 0.01 

“ 0.01 0.01 0.01 0.00 0.00 0.00 0.95 0.01 0.01 0.00 

0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.92 0.01 0.04 
0.00 0.00 0.00 0.01 0.03 0.00 0.00 0.00 0.94 0.00 
0.00 0.00 0.00 0.04 0.01 0.01 0.01 0.01 0.00 0.92 
\0.94 0.01 0.00 0.01 0.01 0.01 0.00 0.02 0.01 0.00/ 

/O.OO 0.95 0.01 0.00 0.01 0.01 0.00 0.00 0.01 0.00 \ 
0.01 0.00 0.93 0.01 0.03 0.00 0.00 0.00 0.00 0.00 
0.00 0.00 0.00 0.94 0.00 0.00 0.00 0.03 0.00 0.01 
0.01 0.01 0.00 0.00 0.92 0.03 0.01 0.01 0.01 0.00 
s _ 0.01 0.00 0.03 0.01 0.00 0.92 0.01 0.00 0.02 0.00 

- 0.01 0.00 0.01 0.00 0.00 0.00 0.96 0.01 0.00 0.00 • 

0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.93 0.00 0.03 
0.00 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.95 0.00 
0.00 0.00 0.00 0.03 0.01 0.00 0.01 0.00 0.00 0.94 
\0.95 0.01 0.00 0.01 0.01 0.01 0.00 0.01 0.01 0.00/ 




6.6 Exercises 225 



6.6 Exercises 

1. Consider the continuous multi-extremal optimization problem in Exer- 
cise 5.1. Add noise to the performance function 5, and apply the CE 
method, for example as implemented in the program cecont.m of Ap- 
pendix A. 3. In particular, add U(— 0.1, 0.1), N(0,0.01) and N(0, 1) noise. 
Observe how 7^, 'pt and at behave. Formulate an appropriate stopping 
criterion, for example based on at. 

2. Consider the Rosenbrock experiment of Table 5.1 and the corresponding 

Matlab code in Appendix A. 5. Add N(0, 0.01) noise to the objective func- 
tion RosPen by appending the line S = S + 0 . l*randn(N, 1) ; to the end 
of the code. Observe the convergence behavior of the CE algorithm for 
10000 iterations, using the original parameters. Compare this with the 
behavior when N = 200, = 5 and /? = 3.5. 

3. Add random noise to the data from the URL 

http : //www . iwr . uni-heidelberg . de/groups/comopt/sof tware/TSPLIB95/atsp/ 
and obtain a table similar to Table 6.15. 




7 



Applications of CE to COPs 



In this chapter we show how CE can be employed to solve combinatorial op- 
timization problems (COPs) from various fields. Our goal is to illustrate the 
fiexibility and efficiency of the CE algorithm. We believe that the CE ap- 
proach could open up a whole range of possible applications, for example in 
computational biology, graph theory, and scheduling. Section 7.1 deals with 
an application of the CE method concerning the alignment of DNA sequences; 
this section is based partly on [98]. In Sections 7.2-7. 4, adapted from [113], the 
CE method is used to solve the single machine total weighted tardiness prob- 
lem (SMTWTP), the single machine common due date problem (SMDDP), 
and the capacitated vehicle routing problem (CVRP). Finally, in Section 7.5 
we consider various CE approaches for solving the maximum clique and re- 
lated graph problems. 



7.1 Sequence Alignment 

Sequence alignment is a frequently encountered theme in computational biol- 
ogy. Many biologically important molecules are linear arrangements of sub- 
units and can therefore be characterized as sequences. For example, a protein 
consists of amino acid residues linked by peptide bonds in a specific order 
known as its primary structure. A protein can alternatively be character- 
ized as a sequence of larger subunits called secondary structures. Yet another 
characterization of a protein is its tertiary structure: the sequence of spatial 
positions and orientations taken by each of its amino acid residues. In order to 
study the structural, functional, and evolutionary relationships amongst bio- 
logically similar molecules, it is often useful to first align the corresponding 
sequences. 

There are various forms of sequence alignment. Alignments can be made 
between sequences of the same type (for example, between the primary 
structures of proteins) or between sequences of different type (for example, 
alignment of a DNA sequence to a protein sequence, or of a protein to a 




228 7 Applications of CE to COPs 



three-dimensional structure). Pairwise alignment involves only two sequences, 
whereas multiple sequence alignment involves more than two sequences (al- 
though the term sometimes encompasses pairwise alignment also). Global 
alignment aligns whole sequences, whereas local alignment aligns only parts 
of sequences. In this section we consider only pairwise global alignments. 

Algorithms for sequence alignment have been studied extensively. The in- 
augural paper on the subject is [122] and a useful reference is [74]. Other ex- 
amples (see also [128]) include the dynamic programming approach initiated 
by [122], and the polyhedral approach initiated by [97], the hidden Markov 
model approach [104], and the Gibbs sampler approach [105]. The latter two 
involve randomized optimization algorithms. 

Alignments 

Consider two sequences of numbers 1, . . . , ni and 1, . . . , n 2 - An alignment is an 
arrangement of the two sequences into two stacked rows, possibly including 
“spaces.” Two opposite spaces not allowed. An example for n\ = 10 and 
ri 2 = 6 is given in Table 7.1. 



Table 7.1. An example of an alignment. 

r 12-3456789 10 

X 1-234--56 - - 



The significance of an alignment becomes clear when we assign a meaning 
to it. In particular, the two sequences of numbers could be associated with 
the positions of characters in a DNA or protein sequence. For example, the 
sequence 1, 2, . . . , 10 could be associated with the DNA string AGTGCAGATA, 
and the sequence 1,2, ...,6 with another DNA string ACTGGA. Each align- 
ment of the sequences 1, 2, . . . , 10 and 1, 2, . . . , 6 corresponds therefore with 
an “alignment” of the DNA strings. For example the alignment in Table 7.1 
corresponds to the string alignment in Table 7.2. 

Table 7.2. String alignment corresponding to the alignment in Table 7.1. 

AG-TGCAGATA 

A-CTG-~GA““ 



To be useful in applications, an alignment of two sequences should refiect in 
some way the commonalities of the sequences. Some alignments are therefore 
better than others. This concept is formalized using a performance function 
(often called a scoring function) to assign a value to an alignment. The most 




7.1 Sequence Alignment 229 



elementary way to do this is to use the edit distance between the aligned 
strings. In this case the performance of an alignment is derived from the 
aligned strings such that each character mismatch (for example A opposite to 
C) increases the performance by 1 and each insertion (space) also increases 
the performance by one. The objective is to find an alignment for which the 
edit distance is minimal. Table 7.3 gives two examples. Note that in this case 
the minimal edit distance is 5. 



Table 7.3. Two possible alignments, with the corresponding edit distance. 



AGT-GCAGATA 
ACTG--G-A-- 
Perfonnance: 8 



AG-TGCAGATA 
A-CTG--GA-- 
Perf onnance : 6 



Alignment Graph, Alignment Vector 

A convenient way of looking at alignments is to associate with each align- 
ment a unique alignment path through an alignment graph. An example of an 
alignment graph for the sequences 1,. . . ,10 and 1,. . . ,6 is given in Figure 7.1. 
The alignment path corresponding to Table 7.1 is also depicted. In particular, 
starting at (0, 0) the path jumps “to the right” whenever a space is inserted 
in the first sequence, and it jumps “downwards” whenever a space is inserted 
in the sequence; otherwise it jumps diagonally. Note that each path ends at 
(ni,n2). 



0123456789 10 




Fig. 7.1. Alignment graph. Directed edges from (i,^‘) to (z-h 1, j), (i + 1, j + 1), and 

(^i + i)« 



We can characterize the alignment path also by its alignment vector 
X = {xi,. . . ,Xr), which is a vector of Os, Is, and 2s, where Xi denotes the 




230 7 Applications of CE to COPs 

“direction” the path takes at the z-th traversed node: 0 = horizontal, 1 = 
diagonal, and 2 = vertical. As an example, the alignment in Table 7.1 corre- 
sponds to the path in Figure 7.1 and the alignment vector of length r = 11 
in Table 7.4. 



Table 7.4. Each alignment vector corresponds to an alignment. 



X 


1 


0 


2 


1 


1 


0 


0 


1 


1 


0 


0 




1 


2 


- 


3 


4 


5 


6 


7 


8 


9 


10 




1 


- 


2 


3 


4 


- 


- 


5 


6 


- 


- 



Since there is a one-to-one correspondence between alignments and align- 
ment paths it should not cause confusion if the same symbol x is used to 
represent both objects, and the same symbol X is used to represent the corre- 
sponding spaces. Moreover, any performance function for alignments may be 
regarded as a performance function for alignment paths. Now, let 5(x) denote 
the performance of the alignment corresponding to alignment vector x. An 
optimal global sequence alignment is then an alignment x which solves 

mn5(x) . (7.1) 

Remark 7.1. When using the edit distance, the performance depends in an 
“additive” way on the path. As a consequence, the optimal performance can 
efficiently be determined by dynamic programming (DP) [122, 161, 74, 84]. In 
practice, more complicated performance/scoring functions are used, but DP 
still works in many cases. When the performance function depends on the 
whole path, then the COP (7.1) is NP-complete. In particular, this holds for 
structure alignments, where the alignment corresponds to the spatial position 
of a protein residue (character). For such problems the CE method can provide 
a viable alternative. Moreover, deterministic algorithms only give one possible 
alignment and say nothing about the distribution of optimal alignments. The 
CE method can address this issue without much effort. 



Trajectory Generation and Updating Formulas for CE 

Using the edit distance, the optimal sequence alignment problem can be for- 
mulated in terms of a shortest path problem through the alignment graph; see 
Figure 7.1. The shortest path corresponds to the minimal edit distance. Thus, 
in the formulation of (7.1), we need to find an alignment path x through the 
alignment graph for which the alignment score S'(x) is minimal. We wish to 
solve (7.1) using the CE method. As usual, we need to specify: 

1. How we generate the trajectories in X (that is, alignment vectors), 

2. The probability updating formulas. 




7.1 Sequence Alignment 231 



We generate our trajectories x G by running a Markov chain X = 
(Xi,X 2 , . . . on the alignment graph, starting at (0,0) and ending at 
(ni,n 2 ), with the following one-step transition probabilities in Table 7.5 (for 
all 0 ^ ^ ni — 1 and 0 ^ j ^ri 2 — 1): 



Table 7.5. The Transition Probabilities. 



from 


to 


with prob. 


ihj) 


(* + l,j) 




ihj) 


(*.J + 1 ) 


d{i,j) 


(iJ) 


(i + l,j + l) 





(Note that here r stands for right and d for down.) Moreover, for j = 
0, . . . , ri 2 — 1 the transition probability from (ni , j) to (ni , j -f 1) is 1. Similarly, 
for i = 0, . . . , ni — 1 the transition probability from (i, ri 2 ) to (i -h 1, 722 ) is 1. 
Finally, (ni,n 2 ) is an absorbing state. 

Parameter Update 

Let us gather all parameters {r(i, j), d(i, j), 0 ^ i ^ ni — 1, 0 ^ j ^ ri 2 — 1} 
into a single parameter vector v. For each such v we have 

ni — ln2 — 1 ✓ 

/(x;v)=Pv(X = x)= n n (r{i,j)IxAiJ)i^) 

i=0 j=0 ^ 

+ d(j,j)/A'a(i,j)(x) (7.2) 

+ (1 - r{i,j) - 

Here, ^r{hj) is the set of all paths x making the transition from (i,j) to 
(z-h 1, j); similarly, ^d{hj) is the set of all paths x making the transition from 
(z, j) to {i,j -h 1), and A'(z, j) — A^(z, j) — ^d{hj) is abbreviated to 
It follows that 




Til — ln2 — 1 / 

u 7 {s(x)< 7 } f j)) 

2=0 j=0 ^ 

+ ln(d(i, + ln(l - r{i,j) - d(i, j)) 7;t'(i,j)(X)) . (7.3) 

In view of the fundamental CE formula (4.7) we wish to maximize (7.3) with 
respect to r{i^j) and d{i^j) for all i and j, which amounts to differentiating 
the expression above with respect to r{i,j) and d(z, j) and equating it to zero. 
In the usual way this gives the deterministic updating formulas 




232 7 Applications of CE to COPs 



.. _ Eu /{5(X)^7> -^{X6 Xr.(i,j)} 
"v'tj) in’ r r 

Jliu-f{S(X)^7}-'{xeAr(i,j)} 



(7.4) 



and 



d{i,j) = 



Eu /{S(X)<7} -^{Xe Xg{i,j)} 
Eu ^{S(XK7} -^{X€ 



(7.5) 



In a similar way we derive the formulas for the estimators and dt{i,j) 

at stage t of Algorithm 4.2.1 as 






T,k=i I{S(:x.kK%} d{x^exr{i,j)} 

Ef=l ^{S(XfcK7.} ^{Xfc6 X{i,j)} 



(7.6) 



and 



dt{i,j) 



J2k=i -f{S(Xfc)<7f} ^{x„ex{i,j)} 



(7.7) 



where Xi, . . . ,Xjv are independent alignment paths generated according to 
the transition probabilities rt-i{i,j) and These estimators have an 

easy interpretation. For example, to obtain rt{ij) we count the number of 
paths (out of N) going from (i, j) to {ij + 1) that have a performance less 
than or equal to 7 t, and divide this number by the total number of paths 
passing through (i, j) that have a performance less than or equal to The 
estimator for has a similar natural interpretation. The CE algorithm 

for sequence alignment follows directly from Algorithm 4.2.1 and is given 
below for convenience. 



Algorithm 7.1.1 (Sequence Alignment CE Algorithm) 

1. Choose some initial vector of transition probabilities Vq; for example with 
ro{iJ) = do{iJ) = 1/3. Set t = l, 

2. Generate N paths Xi, . . . ,Xiv of the Markov process X, using the tran- 
sition probabilities specified inwt-i and compute the sample g- quantile of 
the performances. 

3. Using the same sample Xi, . . . , Xiv update the transition probabilities via 
(7.6) and (7.7), for each (i,j)- 

4 . If for some t ^ d, say d = 5, stopping criterion (4.10) is met then stop; 
otherwise set t = t + 1 and reiterate from Step 2. 



Remark 7.2 (Modifications). Various modifications of Algorithm 7.1.1 are pos- 
sible. In particular, we mention the modification in [98] where X is allowed 
to start at any position along the upper border {(0, j), j = 0, 1, ... , 722 } or 
the left border {(z, 0), i = 0, 1, . . . , ni}. The updating formulas for the starting 
probabilities are similar to the ones for the transition probabilities. The advan- 
tage of the latter approach is that a larger part of the alignment graph will be 
explored at an early stage in the procedure. As a consequence a considerable 
speed-up of the algorithm can be achieved. 




7.1 Sequence Alignment 233 



Example 7.3. As an illustration, we provide the outcome of Algorithm 7.1.1 
when applied to two protein sequences from Escherichia coli: Nitrogen Regula- 
tory Protein P-ll 1 (database: gil21386) and Nitrogen Regulatory Protein P-ll 2 
(database: gil707971); see [34]. The two protein sequences are shown below: 



MKKIDAIIKPFKLDDVREALAEVGITGMTVTEVKGFGRQKGHTELYRGAEYMVDFLPK 

VKIERTAQTGKIGDGKIFVFDVARVIRIRTGEEDDAAI 

MKLVTVIIKPFKLEDVREALSSIGIQGLTVTEVKGFGRQKGHAELYRGAEYSVNFLPK 

VKIDVAIADDQLDEVIDIVSKAAYTGKIGDGKIFVAELQRVIRIRTGEADEAAL 



The CE algorithm was implemented with a smoothing parameter a = 0.9 
and an adaptive updating/selection of the parameters such as in the FACE 
modification of Section 5.3, with qq = 0.01 and Nq = 40000. The algorithm 
found the following alignment: 

MKKIDAIIKPFKLDDVREALAEVGITGMTVTEVKGFGRQKGHTELYRGAEYMVDFLPKVKI 

MKLVTVIIKPFKLEDVREALSSIGIQGLTVTEVKGFGRQKGHAELYRGAEYSVNFLPKVKI 

E R__TAQ_TGKIGDGKIFVFDVARVIRIRTGEEDDAAI 

DVAIADDQLDEVIDIVSKAAYTGKIGDGKIFVAELQRVIRIRTGEADEAAL 



which gives an optimal performance of 39. Note that the edit distance does not 
impose gap penalties. A more sophisticated performance function would favor 
a single big gap to multiple small gaps. A typical evolution of the algorithm 
in given in Table 7.6. The CPU time was 94 seconds. 



Table 7.6. Typical evolution of the sequence alignment CE algorithm. 



t 




7t 


1 


133 


146 


2 


106 


125 


3 


89 


102 


4 


63 


77 


5 


49 


59 


6 


42 


48 


7 


40 


42 


8 


39 


40 


9 


39 


39 


10 


39 


39 


11 


39 


39 




234 7 Applications of CE to COPs 

7.2 Single Machine Total Weighted Tardiness Problem 

The single machine total weighted tardiness problem (SMTWTP) is a schedul- 
ing problem with the following description. 

A set V = {1, . . . ,n} of jobs with their corresponding processing times 
{Li , . .. ,Ln} are given. A due date dj and a weight wj (denoting the cost 
of one time unit of tardiness) are associated with each job j. The aim is to 
schedule the jobs without interruption on a single machine that can handle 
no more than one job at a time, so that the total cost is minimized. 

Mathematically, the SMTWTP can be formulated as follows: Let X be 
the set of all the permutations of V . Each permutation x = (xi, . . . , Xn) G X 
defines the order in which the jobs are processed. Thus, for a 5-job problem 
X = (3, 1,2, 5, 4) means that job 3 is processed first, then job 1, then 2, etc. 
Define as a finishing date of a job which is assigned to the place k. This 
depends of course on the schedule x. Specifically, 

k 

1=1 

So the SMTWTP is the problem of finding the solution of the program 

n 

min 5(x) = min max{ fk - dx^ , 0} , (7.9) 

xEA* xGA' ^ ' 

k=l 

where fk is defined by (7.8). Several techniques have been applied to the 
SMTWTP: branch and bound algorithms, simulated annealing, tabu search, 
genetic algorithms, and ant colony optimization; see for example [131, 3, 40, 
27]. We apply here, of course, the CE algorithm. 

We can generate the permutations x either according to Algorithm 4.7.1 
or according to Algorithm 4.7.2 with the associated matrices P = {pij) and 
P = (p(i^j)) (as in (4.39)), respectively. It turns out that with the second 
representation the method works better. 

To illustrate the main steps of the CE algorithm for the SMTWTP consider 
the following simple example. 

Example 7.4 (SMTWTP with 3 Jobs). Consider the single machine total 
weighted tardiness problem with 3 jobs and assume the following parame- 
ters are given: 



Table 7.7. Example parameter for the SMTWTP. 



j 


Lj 


dj 


Wj 


1 


1 


1 


1 


2 


1 


2 


1 


3 


1 


2.5 


1 




7.2 Single Machine Total Weighted Tardiness Problem 235 

Let Pt be the matrix corresponding to stage t of Algorithm 4.7.2. Thus, 

( Pt,(l,l) Pt, {1,2) Pt,(l,3) \ 

Pt,{2,l) Pt,(2,2) Pt,{2,3) I j C^-10) 

Pt,{3,l) Pt,{3,2) Pt,{3,3) / 

where Pt,{i,j) corresponds to the arrangement of the job i to the place j. As- 
sume that Pq is the matrix with equal probabilities 1/3. The six possible sched- 
ules (permutations) are: (1,2,3); (1,3,2); (2,1,3); (2,3,1); (3,1,2); (3,2,1). 
Because of the equal probabilities in Pq? each of these 6 solutions is gener- 
ated initially with probability 1/6. It is easily verified that the corresponding 
objective values are 

5((1, 2 , 3)) = 0.5 ; 5((1, 3, 2)) = 1.0 ; 5((2, 1 , 3)) = 1.5 ; 

5((2, 3, 1)) = 2.0 ; 5((3, 1, 2 )) = 2.0 ; 5((3, 2, 1 )) = 2.0 . 

Assume next that q = 1/3. It follows that the (1 — ^)-quantile of the 
performances 71 is equal to 1.0. Therefore, 

•^{S((1,2,3))<7i} = 1 ; -^{S((1,3,2))<7i} = 1 ; -^{S((2,l,3))<7i} = 0 ! 

■f{S((2,3,l))<7i} = 0 ; -f{S((3,l,2))<7i} = 0 i -f{S((3,2,l))<7i} = ^ ’ 

In the first iteration we have from the deterministic version of (4.41) (note 
that we use here X and S instead of Y and 5, because there is no need to 
distinguish the node placement from the node transition algorithm here): 

EpqJ{s(X)< 7 i,Xi=i} 2/6 

Ep„/{5(X)<7i} 2/6 ’ 

and similarly Pi,( 2 , 2 ) = Pi, ( 2 , 3 ) =Pi,( 3 , 2 ) =Pi,( 3 , 3 ) = 1/2, and for all other i,j 
we have = 0, so that 



1 0 0 \ 

Pi = I 0 1/2 1/2 . 

Vo 1/2 1/2/ 

Using Pi we can generate only two solutions: 

(1,2,3) and (1,3,2). 

Obviously, each of the above solutions is generated with probability 1 / 2 . Not- 
ing that Q = 1/3, it follows that 72 = 0.5. Therefore, 



-^{S((1,2,3)<72} = 1 ; 1’{S((1,3,2)<72} = 0 , 

and the probabilities for the second iteration are given by 




236 7 Applications of CE to COPs 



P2 



100 
0 1 0 
0 0 1 



For example, 



Ep,J{g(x)<72,x3=3> _ 1/2 _ , 

^2, (3,3) TR’ /■ 1 /9 * 

Thus, the CE method indeed has converged to the optimal solution (1, 2, 3). 

Below we present numerical results with the CE Algorithm 4.2.1 for the 
SMTWTP for the set of benchmark problems taken from the site 

http : //mscmga .ms . ic . ac . uk/ j eb/orlib/wtinf o . html 

The set contains 5 problems with 40 jobs, 5 problems with 50 jobs, and 5 
problems with 100 jobs. Table 7.8 presents the performance of the algorithm 
using the following parameters: N = 10 tiP, g = 0.05, a = 0.5. The data 
were averaged over 10 independent replications. In Table 7.8 T denotes the 
average total number of iterations needed before stopping, 71 and 7 t denote, 
respectively, the average of the initial and the final estimates of the optimal 
solution, 7* denotes the best known solution, e denotes the average relative 
experimental error, and e* denote the worse and the best relative exper- 
imental error among the 10 generated optimal solutions, and finally CPU 
denotes the average CPU time in seconds. 



Table 7.8. The average performance of the Algorithm 4.2.1 for the SMTWTP based 
on 10 independent replications. 



Probl. 


n 


T 


71 


7 t 


7* 


e 




e* 


CPU 


1 


40 


21.4 


3091.4 


926.6 


913.0 


0.015 


0.000 


0.019 


47 


2 


40 


26.2 


2255.0 


1240.6 


1225.0 


0.013 


0.001 


0.022 


56 


3 


40 


18.4 


2243.8 


573.0 


537.0 


0.067 


0.067 


0.067 


43 


4 


40 


22.8 


3269.8 


2107.2 


2094.0 


0.006 


0.006 


0.009 


49 


5 


40 


11.0 


2044.8 


990.0 


990.0 


0.000 


0.000 


0.000 


23 


1 


50 


18.8 


3774.4 


2134.0 


2134.0 


0.000 


0.000 


0.000 


88 


2 


50 


18.0 


3929.8 


1998.0 


1996.0 


0.001 


0.001 


0.001 


93 


3 


50 


15.2 


4631.2 


2583.0 


2583.0 


0.000 


0.000 


0.000 


71 


4 


50 


14.0 


5500.2 


2691.0 


2691.0 


0.000 


0.000 


0.000 


66 


5 


50 


20.8 


4700.2 


1580.4 


1518.0 


0.041 


0.023 


0.057 


97 


1 


100 


34.4 


25387.6 


5988.0 


5988.0 


0.000 


0.000 


0.000 


2111 


2 


100 


32.0 


24230.8 


6304.2 


6170.0 


0.022 


0.015 


0.024 


1947 


3 


100 


33.4 


19855.2 


4311.0 


4267.0 


0.010 


0.008 


0.013 


2011 


4 


100 


53.6 


19460.0 


5042.8 


5011.0 


0.006 


0.002 


0.009 


3320 


5 


100 


44.4 


21333.6 


5283.0 


5283.0 


0.000 


0.000 


0.000 


2743 




7.3 Single Machine Common Due Date Problem 237 

Table 7.9 presents a typical evolution of the CE Algorithm for the 
SMTWTP with the indicator function for benchmark Problem 1 with 50 jobs. 
The parameters were as given above, in particular N = 25000. In the table t 
is the iteration number, the best solution obtained at the t-th. iteration, 
7t the worst solution of the elite samples, and = min^ maxg Pt,{r,s)' 

Table 7.9. Typical evolution of the FACE algorithm for the SMTWTP. 



t 




7t 


jpmm 


1 


3392 


6850 


0.030 


2 


2669 


4482 


0.035 


3 


2452 


3224 


0.080 


4 


2243 


2621 


0.121 


5 


2163 


2392 


0.148 


6 


2163 


2340 


0.159 


7 


2134 


2335 


0.201 


8 


2134 


2247 


0.306 


9 


2134 


2184 


0.286 


10 


2134 


2134 


0.342 


11 


2134 


2134 


0.428 


12 


2134 


2134 


0.512 


13 


2134 


2134 


0.631 



7.3 Single Machine Common Due Date Problem 



The description of the single machine common due date problem (SMCDDP) 
is similar to that of the SMTWTP in the previous section. Again, we are 
given a set V = {!,..., n} of jobs and a set of corresponding processing 
times {Li, . . . , Ln}. As before, the jobs have to be processed on one machine. 
However, now all the jobs have a common due date d. For each job j an 
earliness aj and tardiness bj penalty is given. These are incurred if the job is 
finished before or after the common due date d, respectively. The goal is to 
find a schedule for the n jobs that minimizes the sum of earliness and tardiness 
penalties. 

As for the SMTWTP, X denotes the set of all the permutations of V; 
a permutation x = (xi, . . . ,Xn) corresponds to the order in which the jobs 
are processed. Let fk as in (7.8) denote the finishing time of a job which is 
assigned to place k in the processing order. The program to be solved is: 



min5(x) = min 

xGA' xeA' 




max{d- /fc,0} + 



max{/fc - d, 




(7.11) 




238 7 Applications of CE to COPs 



We can use the same permutation generation algorithms as in the TSP 
and SMTWTP. Numerical results for various benchmark problems from 

http : //mscmga . ms . ic . ac . uk/ j eb/orlib/schinf o . html 

are presented in [113]. 



7.4 Capacitated Vehicle Routing Problem 

The capacitated vehicle routing problem (CVRP) presents the basic version 
of the more general vehicle routing problem (VRP), first proposed by Dantzig 
and Ramser [42]. The CVRP can be described as follows (see Figure 7.2 for 
an illustration): A set of n customers, labeled {1,2, ...,n}, must be served 
from a single depot, labeled 0. The transit cost from i to j is given by Cij for 
each 0 < z, j < n. We assume that the cost structure is symmetric, that is, 
z= Cji. We also assume ca = 0. Each customer i has a demand di of goods 
and a vehicle of capacity D is available to deliver goods. Since the vehicle 
capacity is limited, the vehicle has to periodically return to the depot for 
reloading. It is forbidden to split customer delivery. The goal is to find the 
set of tours of minimal total cost, such that each tour begins and ends in the 
depot, each customer is served by exactly one tour and the total capacity of 
each tour is at most D. Various references on the (C)VRP may be found in 
[35] and [133]. 




To solve the CVRP using CE we use an algorithm similar to the TSP. In 
particular, we first assume that the graph is fully connected, assigning costs 
oo to “nonexisting” edges. We then generate tours X in exactly the same way 
as for the TSP, by representing them as random permutations (Xq, . . . , Xn) of 
(0, 1, . . . , n). To see how a tour /permutation x = {xq, . . . , Xn) and the di and 
D uniquely define a vehicle route, consider Figure 7.3. Here x = (0, 1, . . . , n) 
corresponds to the tour This tour only visits the depot 




7.4 Capacitated Vehicle Routing Problem 239 



(0) at the beginning and at the end. However, when at a certain node k we 
have d\ -\- d >2 dj^ ^ D and d± + c ?2 “h • • • “h d^ + ^ the vehicle 

must return to the depot between nodes k and A; + 1. Suppose that this is 
the case for A: = 4. For the second part of the vehicle route we can likewise 
determine exactly when the vehicle needs returning to the depot, for example 
after having visited node 8. We can see that for a given demand structure 
{di} and capacity D the tour x in Figure 7.3 could correspond uniquely to 
the vehicle route given in Figure 7.2. 

The only difference with the TSP is in calculating the performance of a 
tour X, which is given by the total cost of the vehicle route determined by x. 
For example the performance of x = (xq, . . . , x^) = (0, 1, . . . , n) in Figure 7.3 
is given by 



5(x) = 



n— 1 

^Xi,Xi+i + ^Xn,Xo 

2=0 



+ C40 + Co5 — C45 + C80 + Co9 — Cs 9 , 



where the terms in brackets correspond to the usual TSP performance, as in 
(4.32). 



4 5 




Fig. 7.3. Describing the routes via a tour /permutation. 



Various benchmark problems for the (symmetric) capacitated vehicle rout- 
ing problem can be found at the following URL: 

http : //www . iwr . uni-heidelberg . de/groups/comopt /sof tware/TSPLIB95 

Table 7.10 presents the performance of Algorithm 4.2.1 for the CVRP. The 
parameters are: N = 10 n?, g = 0.05, a = 0.5, and the data were averaged 
over 10 independent replications. In Table 7.10 we use the same notation as 
in Table 7.8. 




240 7 Applications of CE to COPs 



Table 7.10. The average performance of the Algorithm 4.2.1 for the CVRP based 
on 10 independent replications. 



name 


n 


T 


71 


7t 


7* 


£ 


€* 


£* 


CPU 


eil22 


22 


39.0 


611.5 


377.4 


375.0 


0.006 


0.001 


0.012 


24 


eil23 


23 


30.5 


1005.4 


569.0 


569.0 


0.000 


0.000 


0.001 


24 


eil30 


30 


54.8 


1095.0 


515.4 


- 


- 


- 


- 


127 


eil33 


33 


78.0 


1335.3 


852.4 


834.0 


0.022 


0.020 


0.025 


142 


eil51 


51 


107.2 


1352.4 


530.0 


- 


- 


- 


- 


980 



Table 7.11 represents a typical evolution of the FACE Algorithm for the 
CVRP for benchmark problem eil22. The parameters are N = 4840, g = 
0.05, and a = 0.5. The relative experimental error is e = 0.001. 

Table 7.11. Typical evolution of the FACE algorithm for the CVRP. 



t 




7t 


pmrYX 


1 


597.8 


699.7 


0.065 


2 


566.2 


643.5 


0.090 


3 


521.9 


595.4 


0.087 


4 


482.1 


563.4 


0.094 


5 


473.3 


527.5 


0.108 


6 


451.6 


507.9 


0.117 


7 


441.1 


497.4 


0.147 


8 


445.1 


486.1 


0.136 


9 


424.7 


471.6 


0.130 


10 


403.3 


458.6 


0.138 


11 


400.8 


447.7 


0.160 


12 


375.3 


428.8 


0.148 


13 


376.5 


406.8 


0.178 


14 


375.3 


389.0 


0.226 


15 


375.3 


375.3 


0.357 


16 


375.3 


375.3 


0.397 


17 


375.3 


375.3 


0.438 


18 


375.3 


375.3 


0.500 



7.5 The Clique Problem 

The maximum clique (MC) problem is a classical combinatorial optimization 
problem with important applications in different fields, such as cluster anal- 
ysis, information retrieval, mobile networks, and computer vision. The MC 
problem is highly intractable and it presents one of the first problems that 
has been proven to be NP-complete. 




7.5 The Clique Problem 241 



Let G = {V, E) be an arbitrary undirected graph, where V = {1, 2, . . . , n} 
is the vertex set of G, and E QV xV is the edge set of G. The total number 
of vertices (here n) is called the order of G; the total number of edges the 
size of G, [29]. The symmetric matrix Aq = {aij) where Oij = 1 if (i, j) G E 
is an edge of G, and aij = 0 if (i, j) ^ is called the adjacency matrix of 
G. A graph G = (V, E) is said to be complete if all its vertices are pairwise 
adjacent, that is, Vi, j G V withi ^ j we have (i, j) G E. For any subset C of 
V the graph that contains all edges of G that join two vertices of C is called 
the subgraph of G spanned by C; notation G(C). A clique C is a subset of V 
such that G{C) is complete. The clique number of G, denoted by o;(G), is the 
order (the number of vertices) of the maximum clique. The MC problem asks 
for cliques of maximum order |C|: 

(jo{G) = max \C\ : C is a clique in G. 

An example is given in Figure 7.4. 




Fig. 7.4. A graph with maximum clique {1,2, 3,4}. 



It is readily seen that the maximum clique is {1, 2, 3,4}, which has clique 
value (cardinality, order) 4. Hence, the clique number of the graph is 4. Note 
that the corresponding (6 x 6) adjacency matrix A = {aij) is given by 



/O 1 1 1 1 0\ 
101110 
110101 
111000 
110000 
\ooioooy 



(7.12) 



We should distinguish a maximum clique from a maximal clique. A maximal 
clique is a clique that is not a proper subset of any other clique. A maximum 
clique is a maximal clique that has maximum order. 




242 7 Applications of CE to COPs 



We shall represent a clique {xi , . . . ,Xr} via a “clique vector” x = (xi,X 2 , 
. . . , Xr), where G {1, . . . , n}, for i = 1, . . . , r ^ n, and xi ^ X 2 ^ • • • 7 ^ 

Let X denote the space of all such vectors. Examples of clique vectors x e X 
for the graph in Figure 7.4 are (1), (5), (1, 2), (3, 2, 1), and (3, 2,4, 1) — but 
not (1,2,3,4, 5), (2,6), and (1,5, 2,4). For each vector x = (xi,...,Xr) the 
corresponding clique value (cardinality of the clique) is given by 

5((xi,...,x^)) = r . 

With this notation, the maximum clique problem is of the form (4.2). 

Remark 7.5 (Related problems). The MC problem is related to various other 
problems in graph theory. In particular, it is closely associated with the max- 
imum independent set and the minimum vertex cover problems. 

An independent set of a graph G = (V,E) is a subset of V whose elements 
are pairwise nonadjacent. The maximum independent set problem is to find an 
independent set in G of maximum order (cardinality). The order of this set is 
called the stability number of G, denoted by cr{G). For the graph in Figure 7.4 
we have cr{G) — 3, where the maximum independent set is {4, 5, 6}. 

A vertex cover of a graph G = (V^E) is a. subset of V such that every edge 
{i,j)eE has at least one endpoint i or j in the subset. The minimum vertex 
cover problem is to find a vertex cover of minimum order. For example, in the 
graph in Figure 7.4 a possible minimum vertex cover is {5, 6}. 

The relation with the maximum clique problem is the following: Let G = 
(V, E) be the complement graph ofG= (V, E), that is, the graph with edge set 
E = {(i, j) : i^ ^y^i ^ j and (i, j) ^ E}\ see for example Figure 7.5. Then, 
C is a clique of G if and only if C is an independent set of ( 5 , and if and only 
if F \ C is a vertex cover of G. Hence (jj{G) = cj{G). In the example {1, 2, 3, 4} 
is a maximum independent set of G, and {5, 6} is a minimum vertex cover of 
G. Any result obtained for one of the above problems has its equivalent forms 
for the other problems. 




Fig. 7.5. The complement graph of the graph in Figure 7.4 (the dashed edges 
correspond to the edges in the original graph). 




7.5 The Clique Problem 243 



Trajectory Generation for Cliques 

To explain the trajectory generation, consider the graph in Figure 7.4. It 
will be convenient to define an additional node, with label 0, to the graph, 
which is adjacent to all other nodes. By doing so, we obtain an expanded 
7-node network instead of the original 6-node network. It is readily seen that 
in CE framework the MC problem is a SEN-type problem, and the trajectory 
generation is determined by probability matrix P. For example, for our 7-node 
network, we associate with A the following 7-state probability matrix 

/ 0 Poi P 02 P03 P04 P05 P 06 \ 

0 0 Pi2 Pi3 Via Pi5 0 

0 P21 0 P23 P2A P25 0 

P= Opsi Ps2 0 P34 0 P36 . (7.13) 

0 Pai Pa2 PasO 0 0 

0 P51 P52 0 0 0 0 

^00 0 pqs 0 0 0 y 

For a given P = (pij) we generate the clique vector X = (Xi , . . . ,Xr) in 
the following way: 

Algorithm 7.5.1 (Naive Search) 

1. Generate X\ according to the distribution formed by the 0-th row of P, 
that is, {poi , . . . ,pon)« Let k = 1. 

2. Generate Xk+i according to the distribution formed by the Xi-st row of 

P. 

3. Proceed with the trajectory generation through the nodes (states) Xi,X 2 , 

. . . , Xr until some node, say node Xt-\-\ is reached, which does not belong 
to the clique. 

4- Deliver the clique value S{X) = r. 

Remark 7.6. In contrast to the TSP we do not require that all off-diagonal 
entries of P be nonzero. Indeed, we always assume that pij = 0 if aij = 0. 
The total number of nodes connected with a node k is called the degree of 
the node, written 5{k). Note that the maximum clique value associated with 
a node k is always less than or equal to its degree. 

Remark 7.7. While proceeding from state to state X^^+i in Step 3 of Al- 
gorithm 7.5.1 one can either set the probability to zero and then 

renormalize the remaining probabilities in the row Xk-^i of P or reject the out- 
come Xk when generating from the X^+i-st row of P. This avoids duplication, 
that is, the possibility that X^+i = X^. 

Remark 7.8. If during the course of Algorithm 7.5.1 one obtains a clique value 
k, then one should set to zero all columns of P associated with nodes that 
have degree less than or equal to k, because these nodes can never be part 




244 7 Applications of CE to COPs 



of the current clique. If, for example, in our 6-node network Xi = I, then 
one can immediately eliminate node 6 (with S{6) = 1) from the generation 
algorithm by setting the 6th column of F to 0 and normalizing the rows. If 
subsequently X 2 = 3, we may repeat the same elimination procedure for node 
5, because 5(5) = 2. 

Let Pt = {pt,ij) denote the transition matrix at iteration t of the main CE 
Algorithm 4.2.1. As usual, we start from po ,02 = 1/^? * = 1, . . . ,n, and we 
choose the nonzero elements in each of the other rows equal, that is, l/S{k) for 
each row k. For the pt,ij we use the same updating rules (4.37) as for the TSP 
(with ^ replaced with ^). Note that even if there is a unique optimal clique, 
there will be various possible ODTMs. For example, two possible ODTM for 
the optimal clique {1, 2, 3, 4} are 

/O 1 0 0 0 0 0\ 

0 0 pi2 = l 0 0 00 

0 0 0 P23 = l 0 00 

p*(i) =0 0 0 0 ]?34 = 100 (7.14) 

0 P41 = 1 0 0 0 0 0 

00 0 0 0 00 

\0 0 0 0 0 0 0 / 

and 

/O 0 0 0 1 0 0\ 

0 0 0 0 pi4 = 100 

0 P21 = 1 0 0 0 0 0 

P*(2) ^0 0 P32 = 1 0 0 00. (7.15) 

0 0 0 P43 = 1 0 0 0 

0 0 0 0 0 00 

\0 0 0 0 0 0 0/ 

Note that in (7.14) and (7.15) states 5 and 6 are redundant. 

Next, we shall enhance Algorithm 7.5.1. This modification will lead to low 
bias clique generation and is called the deep search algorithm to distinguish it 
from Algorithm 7.5.1, which is called the naive search algorithm. In the deep 
search algorithm we consider the clique value Skr = ^ obtained by Algorithm 
7.5.1 only as a possible (worst) alternative. 

Algorithm 7.5.2 (Deep Search) 

1-3. As in Steps 1-3 of Algorithm 7.5.1. 

4 * . Return to the previous node Xr and try from node X^ again ( randomly 
or deterministically) until either a larger clique is found or not. In the 
latter case stop and go to Step 5* below. In the former case continue with 
the trajectory Ai, X 2 , . . . , (generated at Step 3 of Algorithm 7.5.1) 
proceeding through the nodes Xr+i, Xr +21 • • • where X^+i corre- 

sponds to the node which does not belong to the clique passing through the 




7.5 The Clique Problem 245 



nodes Xi, . . . , . Proceed with the loop based on Step 3 - Step 4^ , that 

is, return to Step 3 of Algorithm 7.5.1 and proceed from node again 
until either a larger clique is found or not. In the latter case, stop and go 
to Step 5* below. 

5*. Deliver the clique value where b corresponds to the total number of 
successful loops through Step 3 and Step 4* until stopping. 

Remark 7.9 (Even Deeper Search). Let • • • ? be the q largest 

(elite) clique values from the sample performances (clique values) 5 ^ 4 ,..., 
St,N obtained at the t-th iteration of Algorithm 7.5.2. One can try to enlarge 
the cliques from the set £ = • • • ? by arguing as follows: 

Choose any node contained in a particular clique with a clique value in £, label 
that node as Xr and proceed directly to Step 4* of Algorithm 7.5.2. If, for 
example, in our 6-node network the clique value 3 belongs to the elite set £, 
and is associated, say, with the nodes 1, 2, 3, then we could select (randomly or 
deterministically) any node from the collection {1,2,3} and proceed directly 
to Step 4* of Algorithm 7.5.2, hopefully obtaining a larger (the largest) clique 
value 4 associated with the nodes 1, 2, 3, 4. Again, such a policy of selecting 
a node (nodes) from the collection (1, 2, 3} can be applied to all three nodes 
1, 2, 3 in turn. 

Remark 7.10. Let r be a clique associated with the starting node k and let 
• • • ? ^{kr) be the degrees of the nodes in the clique, arranged in increasing 
order. Clearly, if r = ^(fci), then Algorithm 7.5.2 will be unable to enlarge 
the clique value r starting at any node associated with r. It also follows 
that if r < then in order to attempt to enlarge r it is more efficient 
(computationally) to proceed with Step 4* of Algorithm 7.5.2 from the node 
associated with the smallest row cardinality 

Remark 7.11 (Reducing Randomness). In order to avoid additional random- 
ness and to speed up the algorithm, one can proceed directly to Step 2 of 
Algorithm 7.5.2 deterministically by ignoring Step 1. Indeed, for a given 
sample size N and given vector {pt,ok, h = l,...,n), generate for each 
k a total of N pt^ok trajectories (in fact, the integer part N pt^ok) start- 
ing at node k, k = l,...,n. In our 6-node example, let N = 10 and let 
{Pt,ok^ k = 1, . . . ,6) = (1/5, 1/5, 1/5, 1/5, 1/10, 1/10). It follows that in this 
case {N pt^ok^ A: = 1, . . . , 6) = (2, 2, 2, 2, 1, 1). This procedure is called 
stratification. 

Remark 7.12 (Faster Trajectory Generation) . Step 2 of the CE algorithm 
requires generating a random variable from a discrete distribution p = 
{pi, . . . ,Pn)^ which can be quite time consuming. Similar to Algorithm 4.11.4 
we can speed up the trajectory generation by dividing the pi into two groups 
(in Algorithm 4.11.4 there are three): one group consisting of the c highest 
probabilities, with c typically between 5 or 10, and the other group consist- 
ing of the rest. We apply the inverse-transform method to draw from the 




246 7 Applications of CE to COPs 



first group, while the elements in the second group are drawn from a discrete 
uniform distribution. 

We ran Algorithm 4.2.1 with N = 3n? for different case studies given in 
the URL 

ftp : //dimacs . rutgers . edu/pub/challenge/graph/benchmarks/ clique 

(needs to be accessed via anonymous ftp under Unix). Problems brock200_l , 
brock200_2, brock200_3, brock200_4 and brock400_4 were attempted and 
in all cases we obtained the relative experimental error e — 0, that is Algorithm 
4.2.1 performed perfectly. All runs were made on a 600MHz PHI computer 
with sample size N = 3m? and smoothing factor a = 0.7. The algorithm 
stopped if the best sampled solution did not change for 5 consecutive itera- 
tions. The network sizes were up to n = 400 nodes and the maximum clique 
value was up to 33 (for brock400_4). A typical evolution (for brock200_l), us- 
ing the naive search, is given in Table 7.12. The following optimal clique (with 
size 21) was found: {4, 26, 32, 41, 46, 48, 83, 100, 103, 104, 107, 120, 122, 132, 137, 
138,144,175, 180,191,199}. A similar result for brock400_4 is given in Ta- 
ble 7.13. In five repetitions the exact optimal clique {7, 8, 17, 19, 112, 135, 
147, 154, 157, 161, 186, 197, 202, 211, 241, 242, 245, 247, 266, 267, 270, 294, 
324, 334, 340, 343, 353, 362, 380, 389, 393, 394, 396} was always found. 



Table 7.12. Typical evolution of the clique CE algorithm. 



t 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


It 


17 


18 


19 


20 


20 


20 


20 


20 


20 


21 



Table 7.13. Typical evolution of the CE algorithm for brock400_4 . 



t 


1 


2 


3 


4 


5 


6 


7t 


21 


21 


22 


33 


33 


33 



We also ran the improved trajectory generation algorithm of Remark 7.12 
and compared it with the standard version. For the improved version we set 
the number of first class elements to c = 5. The improved version of the 
algorithm was as accurate as the naive version, while being approximately 
twice as fast. Table 7.14 shows the optimal solution (size of the maximum 
clique) and the execution time range for each instance. The execution time 
range was obtained by executing each algorithm on each instance for five 
times. 




7.6 Exercises 247 



Table 7.14. Optimal solutions and typical execution times for the examined in- 
stances. 



Instance 


Maximum 

clique 

size 


Execution time (seconds) || 


Naive 


Improved | 


Min 


Max 


Min 


Max 


brock200_l 


21 


314 


375 


189 


227 


brock200_2 


12 


135 


144 


91 


92 


brock200_3 


15 


189 


197 


121 


123 


brock200_4 


17 


235 


249 


143 


146 


brock400_4 


33 


2874 


3414 


1857 


2698 



7.6 Exercises 

1. Clique problem. Apply the CE Algorithm 4.2.1 to the clique problem 
of Section 7.5. 

a) First run the algorithm on a synthetic problem, using the naive clique 
generation Algorithm 7.5.1. 

b) When the synthetic problem is working, run the algorithm on a prob- 
lem from the clique benchmark cases from the URL in Section 7.5, 
again using the naive clique generation. 

c) To speed up the trajectory generation apply the procedure outlined 
in Remark 7.12. 

d) Finally, run the FACE Algorithm 5.3.1. 

2. CE for Independent Set and Vertex Cover. Follow the instructions 
for the clique problem, design your own trajectory generation algorithm 
for the independent set and vertex cover problems, and run Algorithm 
4.2.1 first on a synthetic problem and then on a benchmark problem from 
the Web. Introduce noise to your problem by adding U(— a, a) noise to the 
objective function, run the noisy version of Algorithm 4.2.1 and present 
2-3 tables (for different values a) similar to Table 6.14. 

3. CE for Vertex Coloring and Node Coloring. Similar to the indepen- 
dent set and vertex cover problem, design your own trajectory generation 
algorithm for vertex coloring and node coloring problems and run Algo- 
rithm 4.2.1 first on a synthetic problem and then on a benchmark problem 
from the Web. 

4. CE for Permutation Flow Shop Problem (PFSP). In the permu- 

tation fiow shop sequencing problem n jobs have to be processed (in the 
same order) on m machines. The objective is to find the permutation of 
jobs that will minimize the makespan^ that is, the time at which the last 
job is completed on machine m. Let be the processing time for job 




248 7 Applications of CE to COPs 



i on machine j and let x = (xi, X 2 , be a job permutation. Then 
the completion time j) for job i on machine j can be calculated as 
follows: 



C{xi, 1) = t{xi, 1) 

C{xi, 1) = C{xi-i,l) + t{xi, l),Vi = 2, ...,n 
C{xi,j) = C{xi,j - l) + t{xi,j),yj = 

C{xi,j) = ma.x{C{xi^i,j),C{xi,j - l)} + t{xi,j) , 
for all 2 = 2, n; j = 2 , m . 



The objective is to minimize 5(x) = C{xn, rn). The trajectory generation 
for the PFSP is exactly the same as in the TSP. 

a) Follow the instructions at the beginning of Appendix A and run Al- 
gorithm 4.2.1 first for a synthetic problem and then for a benchmark 
problem from the Internet. 

b) To speed up the trajectory generation apply the procedure given in 
the Appendix of Chapter 4. Add noise to your problem and present a 
table similar to Table 6.14. 

5* CE Versus the EM Algorithm. Suppose we have some statistical data 
2 / 1 , . . . , and we wish to fit the data to a mixture of two pdfs. Specifically, 
we assume that 2 / 1 ,..., 2/n are the outcomes of i.i.d. random variables 
Fi, . . . , with density 

f{y- a, 00, 01) = (1 - a) fo{y, 0°) + a/i(y; 0^) , (7.16) 

for some unknown a, 9^ and 6^. For example 

1 1 

/,(2/;0^) = -y=e ^ , i = 0,l, (7.17) 

with 6'^ = (/Xi,crf), 2 = 0,1. A straightforward way of estimating the 
parameters from the data y = ( 2 / 1 , . . . ,yn) is to choose the estimates such 
that the likelihood function 



n 

L{a, 0°, 0i; y) := f{yi\ a, 0°, 0^) (7.18) 

i=l 

is maximized. However, finding these maximum likelihood estimates is 
in general not easy for mixture models, since the likelihood function 
£ (a, is typically multi-extremal. 

In the well-known EM method [115], the likelihood is optimized via an 
iterative procedure. Specifically, consider the mixture model (7.16). We 




7.6 Exercises 249 



may generate the random variables Yi via a two-step procedure: first draw 
a random variable Xi ~ Ber(a) and then draw Yi from Using this 
point of view, we can interpret the data yi , . . . , as only a part of the true 

data. The values of the 0-1 variables xi, . . . , — which indicate whether 

the 2 /z’s were drawn from /o or fi — are hidden. Now, if x = (xi, . . . , Xn) 
were known, then the MLEs of the parameters could be easily estimated, 
namely as 



n 



a = n ^ Xi , 

2=1 


(7.19) 


i:xi=j 


(7.20) 


Y (y* “ i = 0 , 1 • 

i:xi=j 


(7.21) 



On the other hand, if the parameters a, 6^ and 6^ were known, then esti- 
mating the distribution of Xi from the data would be easy, using Bayes’s 
formula: 



Pi = ^{Xi I Yi = Pi) = 



fi{yu0^)a 

f{yi-,a,eo,e^Y 






(7.22) 



Since we do not know the Xi we cannot use formulas (7.19)-(7.21). How- 
ever, instead of maximizing the logarithm of the likelihood function in 
(7.18), we can maximize the expected log-likelihood function 



Elni:(a,6>°,6>^X,y) =Eln 



II (1 - a)/o(2/i;6*°) X IJ afi{yi;eY 

.i:Xi =0 i:Xi = l 



(7.23) 

where X ~ Ber(p), with p = (pi, • • . ,Pn) as defined above. This leads to 
the following algorithm: 



Algorithm 7.6.1 (EM algorithm) 

1. Choose initial estimates (guesses) 6 q, z = 0, 1. Set t = \. 

2. (E-step) For each i define 

f{yi;at-i,0^_i,ej_Y ’ 



and determine the expected log-likelihood function in (7.23), with X ~ 

Ber(pt). 

3. (M-step) Maximize, with respect to a, 6^ and 6^ the expected log- 
likelihood function obtained in the E-step. Call the maximizing pa- 
rameters a^6^ and 6^. Specifically, for the case (7.17) it can he shown 
that 




250 7 Applications of CE to COPs 



at=n 



i=l 

-p _ ~ Pt,i) Vi 

r^t,0 v-^n /I ^ \ 

E n ^ 

_ i=l Pt,i Vi 

fJ't,! - ^ 

^2=1 Pt,i 

^ E”=i(i - Pt,i){vi - Ptfif 

O’ t,o = 






>2=1 

^ ^i=l Pt,i (Pi ~ 

t,l = 



E n ^ 
i=l 



(7.24) 

(7.25) 

(7.26) 

(7.27) 

(7.28) 



4- If some stopping criterion is met, then stop; otherwise set t := t 1 
and reiterate from Step 2. 



Note that EM is a local search procedure and therefore there is no guar- 
antee that it converges to the global maximum. As an alternative to EM 
consider the CE Algorithm. There are two different approaches. First, we 
can view the maximization of (7.18) as a continuous multi-extremal opti- 
mization with respect to a and the 6'^\ see Section 5.1. Alternatively, we 
can seek to maximize the function 



s{^)= n X n ('^■29) 

i:Xi—0 i:Xi=l 

over all x. Note that in this case, we have to estimate S as we go along, 
since we do not know the true parameters. We can do this by using esti- 
mates for the parameters a and 0^ , j = 0, 1. 

a) Present an explicit CE algorithm for the maximization of (7.18) as a 
function of a and the 

b) Present an explicit CE algorithm for the maximization of (7.29) as a 
function of x. 

c) Run the EM algorithm and both versions of the above CE algorithms, 
and compare their efficiencies on a mixture of 10 Normal pdfs with 
Hj = j and = j, j = 1,2, 10. 




8 



Applications of CE to Machine Learning 



In this chapter we apply the CE method to several problems arising in machine 
learning, specifically with respect to optimization. In Section 8.1, adapted 
from [50], we apply CE to the well-known mastermind game. Section 8.2, 
based partly on [112], describes the application of the CE method to Markov 
decision processes. Finally, in Section 8.3 the CE method is applied to clus- 
tering problems. In addition to its simplicity, the advantage of using the CE 
method for machine learning is that it does not require direct estimation of 
the gradients, as many other algorithms do (for example, the stochastic ap- 
proximation, steepest ascent, or conjugate gradient method). Moreover, as a 
global optimization procedure the CE method is quite robust with respect to 
starting conditions and sampling errors, in contrast to some other heuristics, 
such as simulated annealing or guided local search. 



8.1 Mastermind Game 

In the well-known mastermind game the objective is to decipher a hidden 
“code” of colored pegs through a series of guesses. After each guess new in- 
formation on the true code is provided in the form of black and white pegs. A 
black peg is earned for each colored peg that is in exactly the right place, and 
a white peg for each peg that is in the solution, but in the wrong position; 
see Figure 8.1. To get a feel for the game, one can visit 

http : //www . j avaonthebrain . com/ j ava/mastermind 
and play its java implementations. 

Consider a mastermind game with m colors, numbered {!,..., m} and n 
pegs (positions). The hidden solution and a guess can be represented by a 
row vector of length n with numbers in ,m}. For example, for n = 5 

and m = 7 the solution could be y = (4, 2,4, 7, 3), and a possible guess 
X = (4, 3,4, 2,5). Let X be the space of all possible guesses. Note that the 




252 8 Applications of CE to Machine Learning 




Code 



Feedback 

• • o o ■ 



• o o ■ • 



Fig. 8.1. The mastermind game. 



total number of possibilities is m^. On A' we define a performance function S 
which returns for each guess x the “pegscore” 

*9(x) = 2 X A^BlackPegs + -^WhitePegs , (8.1) 

where A^BiackPegs and A^vhitePegs are the number of black and white pegs 
returned after the guess. We have assumed, somewhat arbitrarily, that a black 
peg is worth twice as much as a white peg. As an example, for the solution 
y = (4, 2, 4, 7, 3) and guess x = (4, 3, 4, 2, 5) above, one gets 2 black pegs (for 
the first and third pegs in the guess) and 2 white pegs (for the second and 
fourth pegs in the guess, which match the fifth and second pegs of the solution 
but are in the wrong place). Hence the score for this guess is 6. 

There are some specially designed algorithms that efficiently solve the 
problem for different numbers of colors and pegs. For examples see 

http : //www . mathworks . com/ contest /mastermind . cgi/home . html 

With the above performance function, the problem can be formulated as an 
optimization problem of the form (4.2), and hence we could apply the CE 
method to solve it. In order to apply the CE method we first need to generate 
random guesses X G A'. We can do this via an n x m matrix P. Each element 
Pij of P describes the probability that we choose the j-th color for the i-th. 
peg (location). Since only one color may be assigned to one peg we have that 
Pij — 1 for oach iteration t of the CE algorithm we indepen- 

dently sample for each row (that is, each peg) a color using probability matrix 
P = P^ and calculate the score S according to (8.1). The updating of the el- 
ements pt^ij of the probability matrix Pt is performed exactly as in (4.14). 
Note that in this case the numerator in (4.14) simply counts the number of 
times the color index j has been assigned to peg i for the elite samples. 








8.1 Mastermind Game 253 



8.1.1 Numerical Results 

Table 8.1 represents a typical evolution of the CE algorithm for the master- 
mind test problem denoted on the site 

http : //www . mathworks . com/contest /mastermind . cgi/home . html 
as “problem (5.2.3),” with the matrix P of size n x m = 36 x 33. We used 
N = 6mn = 5940, g = 0.01 and a = 0.7. It took 34 seconds of CPU time to 
find the true solution. The notation in the table is the same as in Section 4.9, 
with St^(Nt) abbreviated to S^. Table 8.2 represents data similar to Table 8.1 
for the FACE Algorithm 5.3.1. The initial sample size is = 594 and the 
elite sample size is 297. The true solution was found in 18 seconds. 



Table 8.1. Evolution of Algorithm 4.2.1 for the mastermind problem with m = 33 
colors and n = 36 pegs. 



t 


St 


7t 


P 7max 


P 4niin 


1 


31 


26 


0.1138 


0.0330 


2 


34 


28 


0.2382 


0.0308 


3 


38 


32 


0.3574 


0.0304 


4 


42 


35 


0.5348 


0.0304 


5 


48 


40 


0.6457 


0.0304 


6 


55 


45 


0.7756 


0.0306 


7 


60 


51 


0.8312 


0.0304 


8 


64 


57 


0.9016 


0.0309 


9 


70 


63 


0.9426 


0.0316 


10 


72 


68 


0.9804 


0.8247 


11 


72 


72 


0.9961 


0.9649 



Table 8.2. Evolution of FACE Algorithm 5.3.1 for the mastermind problem with 
m = 33 colors and n = 36 pegs. 



t 


s; 




Nt 


P 7max 


P 4min 


n~ 


30 


21 


594 


0.05 


0.03 


2 


32 


24 


594 


0.10 


0.03 


3 


33 


28 


5212 


0.17 


0.03 


4 


35 


29 


594 


0.23 


0.03 


5 


36 


29 


594 


0.28 


0.03 


6 


39 


31 


594 


0.30 


0.03 


7 


42 


32 


594 


0.37 


0.03 


8 


43 


34 


594 


0.45 


0.03 


9 


48 


36 


594 


0.52 


0.03 


10 


49 


40 


1600 


0.62 


0.03 



t 


s; 


7t 


Nt 


P 7max 


P ^min 


IT 


51 


42 


594 


0.71 


0.03 


12 


54 


44 


594 


0.76 


0.03 


13 


55 


46 


606 


0.79 


0.03 


14 


59 


49 


594 


0.84 


0.03 


15 


62 


52 


594 


0.87 


0.03 


16 


65 


54 


594 


0.91 


0.03 


17 


66 


57 


594 


0.93 


0.03 


18 


67 


60 


730 


0.96 


0.03 


19 


72 


63 


594 


0.98 


0.03 


20 


72 


70 


11880 


0.99 


0.82 




254 8 Applications of CE to Machine Learning 



Table 8.3 compares the efficiencies (the relative experimental errors and 
the CPU times) of CE Algorithm 4.2.1 and FACE Algorithm 5.3.1 for different 
mastermind games. The results are averaged over 10 diflFerent runs. We see 
that the algorithms have similar accuracies, and that the FACE algorithm is 
typically two times faster than its CE algorithm counterpart. 



Table 8.3. The efficiency of CE Algorithm 4.2.1 versus the FACE algorithm for 
different mastermind games. 



m 


n 


CE time (sec.) 


e{CE) 


FACE time (sec.) 


s{FACE) 


18 


36 


13.667 


0.0000 


7.909 


0.0221 


23 


32 


15.497 


0.0000 


8.720 


0.0000 


30 


41 


41.006 


0.0000 


20.392 


0.0000 


42 


41 


66.695 


0.0000 


35.525 


0.0000 


43 


45 


90.788 


0.0000 


42.469 


0.0000 



8.2 The Markov Decision Process and Reinforcement 
Learning 

The Markov decision process (MDP) model is standard in machine learning, 
operations research, and related fields. We review briefly some of the basic 
definitions and concepts in MDP. For details see for example [25, 132, 174]. 
This section is based partly on [112]. For another example of the power of the 
CE method in the MDP context see [117]. 

An MDP is defined by a tuple {Z, A, P, r), where 

1. Z = {1, . . . , n} is a finite set of states. 

2. ^ = {1, . . . ,m} is the set of possible actions by the decision maker. We 
assume it is the same for every state — to ease notations. 

3. V is the transition probability matrix with elements V{z'\z, a) presenting 
the transition probability from state z to state z', when action a is chosen. 

4. r{z, a) is the reward for performing action a in state z (r may be random). 

At each time instance k the decision maker observes the current state 
Zfc, and determines the action to be taken, say a^. As a result, a reward 
r{zk,ak), or shortly rjt, is received and a new state z' is chosen according 
to V{z'\zk,ak)^ A policy or strategy tt is a rule that determines, for each 
history Hk = zi,a\, . . . ,Zk-i,ak-i^Zk of states and actions, the probability 
distribution of the decision maker’s actions at time k. A policy is called Markov 
if each action is deterministic and depends only on the current state Zk. A 
Markov policy is called stationary if it does not depend on the time k. The 
goal of the decision maker is to maximize a certain reward function. 




8.2 The Markov Decision Process and Reinforcement Learning 255 

The following are standard reward criteria: 

1. Finite horizon reward. 

This applies when there exists a finite time r (random or deterministic) 
at which the process terminates. The objective is to maximize the total 
reward 

T — 1 

5(7r)=E^^rfc. (8.2' 

fc =0 

Here E^r denotes the expectation with respect to some probability measure 
induced by the strategy tt. 

2. Infinite horizon discounted reward. 

The objective is to find a strategy tt that maximizes 

oo 

5(7T)=E,^/?''rfc, (8.3) 

k=0 

where 0 < /? < 1 is the discount factor. 

3. Average reward. 

The objective is to maximize 



1 

Sin) = liminf - E^r 

T— )>oo r 



Tk 



k=0 



In the next section we will restrict our attention to the stochastic shortest path 
MDP, where it is assumed that the process starts from a specific initial state 
zq = and terminates in an absorbing state z^^ with zero reward. For the 

finite horizon reward criterion, r is the time at which z^^ is reached (which 
we will assume will happen eventually). It is well known [132] that for the 
shortest path MDP there exists a stationary Markov policy which maximizes 
S'(7 t), for each of the reward functions above. 

If both r and V are known, then several efficient methods, such as value 
iteration and policy iteration^ can be used to find the optimal policy [132, 
173, 174]. However, if the transition probability or the reward function are 
unknown^ the problem is much more difficult, and is referred to as a learning 
problem. A well-known framework for learning algorithms is reinforcement 
learning (RL), where an agent learns the behavior of the system through trial 
and error in an unknown dynamic environment; see [92]. 

There are several approaches to RL, which can be roughly divided into 
the following three classes: model-based, model-free, and policy search. In the 
model-based approach, first a model of the environment is constructed. The 
estimated MDP is then solved using standard tools of dynamic programming 
[96]. In the model-free approach one learns a utility function, instead of learn- 
ing the model. The optimal policy is to choose at each state an action that 
maximizes the expected utility. The popular Q-learning algorithm [171] is an 




256 8 Applications of CE to Machine Learning 

example of this approach. In the policy search approach a subspace of the 
policy space is searched, and the performance of policies is evaluated based 
on their empirical performance [19, 163]. An example of a gradient-based pol- 
icy search method is the REINFORCE algorithm [175]. A detailed account of 
policy gradient methods can be found in [20]. For an approach which uses a 
direct search in policy space see [139]. The CE algorithm in the next section 
can be viewed as a policy search approach. 

Remark 8.1 (Stochastic Approximation). Many RL algorithms are based on 
the classic stochastic approximation (SA) algorithm. To explain SA, assume 
that we need to find the unique solution a:* of some nonlinear equation S{x) = 
0, where instead of S{x) only an estimate S{x) is available, with E5(x) = S{x). 
The SA algorithm for estimating x* involves the iteration 



xt^i = xt + !3tS{xt) , 

where {^t,t = l,2,...}isa positive sequence satisfying 

oo oo 

/?( = oo, '^/3f <oo . (8.4) 

t=l t=l 

The connection between SA and Q-learning is given in [165]. This work has 
made an important impact on the entire field of RL. Unfortunately, SA is 
known as a very slow converging procedure, because of (8.4). Even if /3k re- 
mains bounded away from 0, (and thus convergence is not guaranteed) it is 
still required that /3k is small in order to ensure convergence to a reasonable 
neighboring solution [30] . We shall employ the CE method instead of S A and 
shall demonstrate its high efficiency. 

8.2.1 Policy Learning via the CE Method 

We consider a CE learning algorithm for the shortest path MDP, where it is 
assumed that the process starts from a specific initial state zq? sind that there 
is an absorbing state with zero reward. The objective is given in (8.2), 
with r being the stopping time at which is reached, which we will assume 
will always happen. 

To put this problem in the CE framework, consider the maximization 
problem (8.2). Recall that for the shortest path MDP an optimal stationary 
strategy exists. We can represent each stationary strategy as a vector x = 
(xi, . . . , Xn) with Xi G m} being the action taken when visiting state 

i. Writing the expectation in (8.2) as 



r— 1 



5(x) = E^'^r{Zk,Ak) , 



fc =0 



(8.5) 




8.2 The Markov Decision Process and Reinforcement Learning 257 



where Zq, Zi, . . . are the states visited, and Aq,Ai,... the actions taken, we see 
that the optimization problem (8.2) is of the form (4.2). We shall also consider 
the case where 5(x) is measured (observed) with some noise, in which case we 
have a noisy optimization problem. The idea now is to combine the random 
policy generation and the random trajectory generation in the following way: 
At each stage of the CE algorithm we generate random policies and random 
trajectories using an auxiliary n x m matrix P = {pza)^ such that for each 
state 2 : we choose action a with probability Pza- Once this “policy matrix” 
P is defined, each iteration of the CE algorithm comprises the following two 
standard phases: 

1. Generation of N random trajectories (Zq, Aq, Zi, Ai, . . . , Z^-, A^-) using 
the auxiliary policy matrix P. The cost of each trajectory is computed 
via 

T— 1 

5(X) = ^r(Zfc,Afe). (8.6) 

fc =0 

2. Updating of the parameters of the policy matrix {pza) on the basis of the 
data collected in the first phase. 

The matrix P is typically initialized to a uniform matrix {pij = 1/m). We 
describe both the trajectory generation and updating procedure in more detail 
below. We shall show that in calculating the associated sample performance, 
one can take into account the Markovian nature of the problem to speed up 
the Monte Carlo process. 



Generating Random Trajectories 

Generation of random trajectories for MDP is straightforward and is given 
for convenience only. All one has to do is to start the trajectory from the 
initial state zq = and follow the trajectory by generating each new state 
according to the probability distribution of P, until the absorbing state 
is reached at time r, say. 

Algorithm 8.2.1 (Trajectory Generation for MDP) 

Input: P auxiliary policy matrix. 

1. Start from the given initial state Zq = set k = 0. 

2. Generate an action A^ according to the Z^-th row of P, calculate the 

reward = r{Zk,Ak) and generate a new state Zk-\-i according to 
P(- 1 Zfc, Ajfc). Set k = Repeat until Zk = 

3. Output the total reward of the trajectory (Zq, Aq, Zi, Ai, . . . , Z^), given by 

( 8 . 6 ). 




258 8 Applications of CE to Machine Learning 



Updating rules 

Given the N strategies Xi, . . . , Xjv and their scores, 5(Xi), . . . , S(Kn), one 
can update the policy matrix {pza) using the CE method, namely as per 

N 

^ , (8.7) 

XI ^{S(Xkmt}^{Xk€X,} 
k=l 

where {X^ G is the event that the trajectory generated by strategy X^ 
contains a visit to state z, and {X^ G ^za} is the event that this trajectory 
contains a visit to state z in which action a was taken. 

We now explain how to take advantage of the Markovian nature of the 
problem. Let us think of a maze where a certain trajectory starts badly, that is, 
the path is not efficient in the beginning, but after some time it starts moving 
quickly towards the goal. According to (8.7), all the updates are performed 
in a similar manner in every state in the trajectory. However, the actions 
taken in the states that were sampled near the target were successful, so one 
would like to “encourage” these actions. Using the Markov property one can 
substantially improve the above algorithm by considering for each state the 
part of the reward from the visit to that state onwards. We therefore use the 
same trajectory and simultaneously calculate the performance for every state 
in the trajectory separately. The idea is that each choice of action in a given 
state affects the reward from that point on, disregarding the past. 

The sampling Algorithm 8.2.1 does not change in Steps 1 and 2. The differ- 
ence is in Step 3. Given a policy X and trajectory (Zq, Aq, Zi, Ai, . . . , Zr), we 
calculate the performance from every state until termination. For every state 
z = Zj in the trajectory the (estimated) performance is Sz(X.) = 

The updating formula for Pza is similar to (8.7), however each state z is up- 
dated separately according to the (estimated) performance Sz(X) obtained 
from state z onwards. 

N 

Pt,.a = ^ . ( 8 . 8 ) 

X ‘^{S,(Xfc)^7t,,}-^{Xfc€Ar,} 

k=l 

A crucial point here is to understand that in contrast to (8.7) the CE optimiza- 
tion is carried for every state separately and a different threshold parameter 
7 t,;z is used for every state z, at iteration t. This facilitates faster convergence 
for “easy” states where the optimal strategy is easy to find. The trajectory 
sampling method above can be viewed as a variance reduction method. Nu- 
merical results indicate that the CE algorithm with updating (8.8) is much 
faster than with updating (8.7). 




8.2 The Markov Decision Process and Reinforcement Learning 259 



Remark 8.2 (2- dependence). It is possible to further improve the efficiency of 
the algorithm by remembering the last two steps of the path, rather than the 
last step. Mathematically, this amounts to replacing the Markov property of 
the trajectories with a similar “2-dependence” property. 

Remark 8.3 (Different reward criteria). The sampling and updating proce- 
dures for the discounted and the average reward criteria are somewhat more 
involved, but not fundamentally different from those discussed above. For a 
more detailed discussion see [112]. 

8.2.2 Numerical Results 

The CE with the updating rule (8.8) and trajectory generation according 
to Algorithm 8.2.1 was implemented for a maze problem, which presents a 
two-dimensional grid. We assume the following: 

1. The moves in the grid are allowed in four possible directions with the goal 
to move from the upper-left corner to the lower-right corner. 

2. The maze contains obstacles ( “walls” ) into which movement is not allowed. 

3. The reward for every allowed movement until reaching the goal is —1. 

In addition we introduce: 

• A small (failure) probability not to succeed moving in an allowed direction. 

• A small probability of succeeding moving in the forbidden direction ( “mov- 
ing through the wall”). 

• A high cost for the moves in a forbidden direction. 

In Figure 8.2 we present the results for a 20 x 20 maze. We set the following 
parameters: N = 1000, g = 0.03, a = 0.7. The initial policy was uniformly 
random. The cost of the moves were assumed to be random variables uniformly 
distributed between 0.5 and 1.5 and uniformly distributed between 25 and 75 
for the allowed and forbidden moves, respectively. Note that the expected cost 
for the allowed and forbidden moves are equal to 1 and 50, respectively. The 
success probabilities in the allowed and forbidden states were taken to be 0.95 
and 0.05, respectively. The arrows z ^ z' m Figure 8.2 indicate that at the 
current iteration the probability of going from z to z' is at least 0.01. In other 
words, if a corresponds to the action that will lead to state z' from z, then we 
will plot an arrow from z to z' — provided Pza > 0.01. 

In all our experiments CE found the target exactly, within 5-10 iterations. 
The CPU time was less than one minute (on a 500MHz Pentium processor). 
Note that the successive iterations of the policy in Figure 8.2 quickly converge 
to the optimal policy. We have also run the algorithm for several other mazes 
and the optimal policy was always found. 




260 8 Applications of CE to Machine Learning 




Fig. 8.2. Performance of the CE Algorithm for the 20 x 20 maze. Each arrow 
indicates a probability > 0.01 of going in that direction. 



8.3 Clustering and Vector Quantization 

The clustering problem reads as follows: given a dataset Z = {zi,...,z^} 
of points in some d-dimensional Euclidean space, partition the data into K 
“clusters” i?i, . . . , Rk (with = 0, for i 7^ and ^jRj = Z^), such that 

some empirical loss function (performance measure) is minimized. A typical 
loss function (see, for example, [172]) is: 

j=i zeRj 

where 

X ^ 

z€Rj 

presents the cluster center or centroid of cluster Rj. Denoting by x = 
(xi, . . . , Xn) the vector with X{ = j when z^ G i?j, and letting Zij = I{xi=j} Zi, 
we can write (8.9) as 




8.3 Clustering and Vector Quantization 261 



K n 
j=l i=l 

where the centroids can be written as 



c 



j 



1 

rij 



n 

2=1 



( 8 . 11 ) 



with rij = ^{xi=j} the number of points in the j-th cluster. 

As we mentioned, our goal is to partition the set of points Z into K not 
necessarily equal sized clusters iZj, such that (8.9) is minimized. In other 
words, we want to find a vector of centroids (ci, . . . , c^) and the correspond- 
ing partition {Rj} that minimize (8.9). This definition also combines both the 
encoding and decoding steps in vector quantization [172]. Namely, we wish to 
“quantize” or “encode” the vectors in Z in such a way that each vector is rep- 
resented by one of K source vectors Ci, . . . , c^, such that the loss (8.9) of this 
representation is minimized. Most well-known clustering and vector quantiza- 
tion methods update the vector of centroids, starting from some initial choice 
(co,i, . . . , co,k) and using iterative (typically gradient-based) procedures. It is 
important to realize that in that case (8.9) is seen as a function of the cen- 
troids, where each point z is assigned to the nearest centroid, thus determining 
the clusters. It is well known that these type of problems — optimization with 
respect to the centroids — are multi-extremal and, depending on the initial 
value of the clusters, the gradient-based procedures converge to a local min- 
imum rather than global minimum. A standard heuristic to minimize (8.10) 
is the K -means algorithm [172] which consists of the following steps: 

1. Initialize by assigning (randomly or deterministically) to each point in Z 
a cluster number in {1, ... , K}. 

2. Calculate the centroids ci, . . . , ck of the clusters. 

3. (Re) assign each point to the nearest centroid. 

4. Repeat Steps 2 and 3 until convergence is reached, for example if the 
clusters no longer change. 

A useful modification is the fuzzy K -means algorithm [28]. A detailed 
description of various types of clustering methods may be found in [52] and 
the accompanying [162]. 

Some useful URLs for clustering analysis: 



http://www.pitt . edu/'^csna/ 

http : //www . astro . psu . edu/statcodes/sc_multclass . html 
http : //www . ph . tn . tudelf t . nl/prt ools/ 




262 8 Applications of CE to Machine Learning 



We next present a CE approach to solve the clustering problem by viewing 
it as a continuous multi-extremal optimization problem where, in analogy to 
the AT-means method, the centroids Ci , . . . , are the decision variables. In 

short, we consider the program 



K 

min S{ci,...,ck) = min V' V] ||z - Cj||^ , (8.12) 

Ci,...,CK Ci,...,CK 

j=i zeRj 

where Rj = {z : ||z — Cj|| < ||z — Cfc||, k ^ j}. That is, Rj is the set of data 
points that are closer to Cj than to any other centroid. 

For better insight and easy reference we consider the program (8.12) with 
K = 2 clusters and present the main steps of the CE method while using 
normal pdfs for updating the centroids Cj, j = 1,2 and assuming that each 
Zi G R^. We associate with the program (8.12) two 2-dimensional normal 
distributions N(^i,Z'i) and N(/i2 5^2)5 where X'i,Z'2 are the corresponding 
covariance matrices. As in a typical CE application for a continuous multi- 
extremal optimization we set the initial matrices , U 2 to be diagonal (with 
quite large variances at the diagonals, say 100) and then we proceed as follows: 

1. Choose deterministically or randomly the initial vectors and /i2- 

2. Generate K = 2 sequences of centroids (for cluster 1 and 2, respectively) 

Yii,...,Yiiv and Y21 , . . . , Y2 at , 

with N(/i^, Z'j), j = 1,2, independently. For each k = 1,...,AT 

calculate the objective function as in (8.9), with Cj replaced by Yjk, j = 

1 , 2 . 

3. Apply the CE Algorithm 4.2.1, (say with g = 0.01 and a = 0.7) and 
update the parameters {/jLi, IJL 2 ) and (i7i,I72), accordingly. 

4. Stop according to stopping rule (4.10) and accept the resulting parameter 

vector {fJiiT, M2 t) &nal T-th iteration) as the estimate of the true 

optimal solution (ci,C2) of the program (8.12). 

Using Remark 2.8 and (3.66), (3.67) we see that the means and vari- 
ances for each centroid are updated simply as the corresponding sample 
mean and sample variance of the elite samples. Specifically, 

if Xi, . . . ,X^eiite are the elite samples corresponding to a specific centroid 
(for cluster 1 or 2), then the related /la and U are updated as 

^elite 

^ ^elite ^ ^ 
i=l 



and 



^yelite 



i=l 




8.3 Clustering and Vector Quantization 263 



Note that we have not assumed independent components for each centroid 
distribution In the 2-dimensional case we need to update therefore 

5 parameters for each centroid. For a d-dimensional normal distribution the 
number of parameters is d+(d+l)d/2. However, if we use mdependen^ compo- 
nents for each N(/i, E) centroid distribution, then the number of distributional 
parameters is 2 d, because only the means and variances need to be updated; 
the off-diagonal elements of E are equal 0. It follows that for K clusters the to- 
tal number of decision variables is 2 d iiT, when using independent components. 
Henceforth we will only consider the case with independent components. 

Remark 8.4 (Starting Positions). One advantage of the CE method is that, 
as a global optimization method, it is very robust with respect to the initial 
positions of the centroids. Provided that standard deviation a is chosen large 
enough, the initial mean has little or no effect on the accuracy and con- 
vergence speed of the algorithm. In general we choose the and a such that 
the initial sampling distribution is fairly “uniform” over the smallest rectan- 
gle that contains the data points. Practically, this means that the initial as 
should be not too small, say equal to the width or height of this “bounding 
box.” 

For the X-means method, however, a correct choice of starting positions 
is essential. A well-known data-dependent initialization method is to gen- 
erate the starting positions independently, drawing each centroid from a d- 
dimensional Gaussian distribution N(/x, Z"), where fi is the sample mean of 
the data and E the sample covariance matrix of the data. 

8.3.1 Numerical Results 

In this section we present numerical experiments using the CE Algorithm 4.2.1 
as well as three well-known clustering heuristics, namely A"-means, fuzzy K- 
means (FKM), and linear vector quantization (LVQ). The Matlab code for 
these last three algorithms was taken from the Matlab classification toolbox 
[162]; see also [52]. 

Two well-known types of 2-dimensional data sets were used from [162]: 

(a) Banana data: Points are scattered around a segment of a circle. 

(b) 3-Gaussian mixture data: Points are generated from a mixture of three 

2-dimensional Gaussian distributions. 

Generation of these data sets is straightforward. For convenience a banana 
data generation algorithm is included in Appendix A. 7. 




264 8 Applications of CE to Machine Learning 



Tables 8. 4-8. 6 present a comparative study of CE Algorithm 4.2.1 and 
the traditional clustering ones for n = 200 and various cluster sizes AT, on 
the models (a) and (b). In all experiments a = 0.7 and g = 0.025. In all 
cases the sample size N = 800 is taken, so that the number of elite samples 
is = gN = 20. All initial standard deviations are 14 for the banana 

data and 6 for the 3-Gaussian data, corresponding to the width/height of 
the bounding box for the data. The initial means are chosen uniformly over 
this bounding box. The starting positions for the other algorithms are chosen 
according to the standard initialization procedure for the A"-means algorithm 
discussed in Remark 8.4. We stop the CE algorithm when the performance 
no longer changes in two decimal places. 

Each method was repeated 10 times for both data sets (a) and (b). We 
use the following notations in the tables: T denotes the average total number 
of iterations; 7^ denotes the averaged solution over 10 runs; 7^ is the best 
known solution; e denotes the average relative experimental error (based on 
10 runs) with respect to the best known solution 7'*'. That is. 



— ryt 



(8.13) 



Similarly, and 6* denote the largest and smallest relative experimental 
errors. Finally, CPU denotes the average CPU time in seconds on a 1.6GHz 
PC. 

In order to identify accurately the global minimum, we repeated our ex- 
periments many times, using different g, N, and smoothing parameter a (see 
Remark 5.2). The smallest value found from these experiments is given by 7^ 
for each case. It was found that the smallest CE performance of the 10 runs 
gives a reliable estimate for the true global minimum. 



Table 8.4. Performance of the four different methods for the data sets (a) and (b), 
with n = 200, K = 5, N = 800, = 20, a = 0.7. 



Approach 


T 


7t 


7^ 


e 


£* 


£* 


CPU 


(a) - Banana data set | 


CE 


49.6 


288.49 


288.11 


0.00 


0.00 


0.01 


26.67 


K-Means 


9.3 


294.31 


288.11 


0.02 


0.01 


0.04 


0.09 


FKM 


80.6 


290.19 


288.11 


0.01 


0.01 


0.01 


0.14 


LVQ 


17.7 


302.81 


288.11 


0.05 


0.01 


0.19 


0.07 


(b) - 3 Gaussian mixture | 


CE 


44.2 


69.15 


69.11 


0.00 


0.00 


0.00 


28.53 


K-Meems 


7.8 


81.68 


69.11 


0.18 


0.02 


0.96 


0.11 


FKM 


43.9 


69.92 


69.11 


0.01 


0.01 


0.01 


0.09 


LVQ 


6.4 


83.75 


69.11 


0.21 


0.06 


0.96 


0.03 




8.3 Clustering and Vector Quantization 265 



Table 8.5. Performance of the four different methods for the data sets (a) and (b), 
with n = 200, K = \Q,N = 800, = 20, a = 0.7. 



Approach 


T 


7t 




e 




e* 


CPU 


(a) - Banana data set | 


CE 


75.8 


197.22 


195.87 


0.01 


0.00 


0.02 


64.25 


K-Means 


9 


221.49 


195.87 


0.13 


0.03 


0.18 


0.20 


FKM 


86.1 


199.36 


195.87 


0.02 


0.01 


0.03 


0.20 


LVQ 


14 


210.85 


195.87 


0.08 


0.01 


0.20 


0.08 


(b) - 3 Gaussian mixture | 


CE 


82.4 


49.15 


48.16 


0.02 


0.00 


0.04 


94.93 


K-Means 


10.9 


58.38 


48.16 


0.21 


0.07 


0.44 


0.42 


FKM 


63.5 


49.04 


48.16 


0.02 


0.00 


0.05 


0.15 


LVQ 


8.4 


54.54 


48.16 


0.13 


0.05 


0.24 


0.05 



Table 8.6. Performance of the four different methods for the data sets (a) and (b), 
with n = 200, K = 20,N = 800, = 20, a = 0.7. 



Approach 


T 


7t 


7T 


e 




£* 


CPU 


(a) - Banana data set 


CE 


142.1 


138.06 


135.80 


0.02 


0.00 


0.03 


261.92 


K-Means 


10.1 


169.03 


135.80 


0.24 


0.12 


0.31 


1.20 


FKM 


385.2 


141.26 


135.80 


0.04 


0.02 


0.07 


1.32 


LVQ 


13.1 


160.84 


135.80 


0.18 


0.12 


0.32 


0.13 


(b) - 3 Gaussian mixture 


CE 


159.8 


31.88 


31.29 


0.02 


0.01 


0.03 


284.98 


K-Means 


10.9 


45.32 


31.29 


0.45 


0.26 


0.66 


2.15 


FKM 


108.8 


32.94 


31.29 


0.05 


0.02 


0.08 


0.38 


LVQ 


8.3 


42.73 


31.29 


0.37 


0.26 


0.58 


0.07 



We see that the CE algorithm, although significantly slower, is more ac- 
curate and consistent than the other algorithms. Among the fast algorithms 
the FKM is by far the best. Observe also from Tables 8.4-8. 6 that as K in- 
creases, the efficiency (in terms of 6, e*, e*) of CE increases relative to their 
counterparts AT-means, FKM and LVQ. We found this in general to be the 
case. This can explained by arguing as follows: 

1. The number of minima of the objective fupction in (8.12) increases with 
K. 

2. The CE method, which presents a global optimization method typically 
avoids the local minima and as a result settles down in the global one. 

3. The alternatives, A"-means, FKM, and LVQ, which present local optimiza- 
tion methods are typically trapped in local minima. 




266 8 Applications of CE to Machine Learning 

It is clear that the “classical” AT-means method with an average relative 
experimental error of 10-100% compares poorly to the CE method, which is 
slower but yields a vastly superior relative error of less than 1%. To formally 
compare the CE approach with some other optimization method one could 
use criterion (4.1). 

Figures 8.3 and 8.4 illustrate for the banana and 3-Gaussian data, respec- 
tively, the difference in the placement of the centroids for CE (circles) and 
KM (crosses). Note that for the 3-Gaussian data the K-means algorithm has 
(wrongly) placed two centroids in the lower left-hand cluster. 



8 
6 
4 
2 
0 

-2 
-4 
-6 
-8 

-6 -4 -2 0 2 4 6 8 

Fig. 8.3. The CE results for vector quantization of the 2-D banana data set. Circles 
designate the final cluster centers (centroids) produced by CE. Crosses are cluster 
centers of the K-means algorithm. 



• 


Data 


o 


CE 


X 


K-means 



• •• 



• •: %•. %• 



V®- 
% • / 



• .• •.% 

• • • X . 

. . * 

• • 

• *r. 



Finally, Table 8.7 represents the evolution of Algorithm 4.2.1 for the 
banana problem with n = 200, and K = b. Here denotes the largest 
standard deviation at iteration t. We found that the CE method for clustering 
works well even if the data set is noisy. This is in contrast to most clustering 
methods that often require stochastic approximation procedures, which are 
very slow. 





8.3 Clustering and Vector Quantization 



267 




Fig. 8.4. The CE results for vector quantization of the 2-D 3-Gaussian data set. 
Circles designate the final cluster centers (centroids) produced by CE. Crosses are 
cluster centers of the K-means algorithm. 



t 


7t 




CTt 


1 


452.18 


377.37 


3.00000 


2 


434.09 


395.01 


3.20577 


3 


420.91 


366.87 


3.38746 


4 


403.82 


356.78 


3.16696 


5 


374.37 


336.55 


3.30599 


6 


364.62 


333.48 


2.94838 


7 


344.82 


325.18 


2.53459 


8 


333.42 


313.58 


2.22936 


9 


317.22 


302.15 


1.58735 


10 


305.09 


295.90 


1.15077 


11 


296.10 


292.34 


0.65408 


12 


291.75 


290.31 


0.45115 


13 


289.66 


288.55 


0.26106 


14 


288.80 


288.50 


0.18901 


15 


288.46 


288.27 


0.11668 


16 


288.26 


288.18 


0.07314 


17 


288.18 


288.14 


0.05173 


18 


288.14 


288.13 


0.03390 


19 


288.12 


288.12 


0.02360 


20 


288.11 


288.11 


0.01730 


21 


288.11 


^ 288.11 


0.01013 


22 


288.11 


288.11 


0.00705 



Table 8.7. Evolution of Algorithm 4.2.1 for the banana problem with n = 200, 
d = 2, K = b,N = 800, = 20 and a = 0.7. 




268 8 Applications of CE to Machine Learning 

8.4 Exercises 



Markov Decision Processes 

1. Consider the MDP problem of Section 8.2. Run Algorithm 4.2.1 for the 
20 X 20 maze problem and obtain a figure similar to that in Figure 8.2. 

Clustering 

2. Generate banana data using the Matlab code in Appendix A. 7. Repeat 
the experiments for the banana data in Tables 8.4 - 8.6 for various choices 
for n and K, for example n = 2000 and AT = 5. As a rule of thumb 
take N = 20 K, Q = 0.025 and a = 0.7. Note that A" is 5 times the 
number of parameters that has to be estimated. Does modified smoothing 
(Remark 5.2) help to bring the solution closer to the global minimum? 

3. An alternative CE approach to clustering is to view the problem as a com- 
binatorial minimal cut (min-cut) problem with n nodes and K partitions. 
In particular, each partition i?i, . . . , Rk is represented via a partition vec- 
tor X G A' = {1, . . . ,n}^ as in (8.9). The “trajectory generation” of the 
CE Algorithm 4.2.1 consists of drawing random X G A' according to an 
n-dimensional discrete distribution with independent marginals, such that 
P(Xi = j) = pij, z = 1, . . . , n, j = 1, , . . ,K. For K = 2, we may, alter- 
natively, use X Ber(p), as in the ordinary {K = 2) max-cut problem. 
With the performance 5(x) given in (8.11), the updating rules for the 
Pij are of the form (4.14), noting that we have here a minimization prob- 
lem. This approach may be useful when the performance is a complicated 
function of the data. For example, the data could represent a collection 
of proteins, each of which has a list of characteristics. In this case the 
performance function is not merely the sum of the Euclidean distances 
but some complicated function of these characteristics. 

Compare the min-cut approach with the continuous multi-extremal ap- 
proach for the banana data set and obtain figures similar to Figure 8.3. 
Run also the FACE Algorithm 5.3.1. Compare the results of the FACE 
Algorithm 5.3.1 with those obtained via the main CE Algorithm 4.2.1. 



Image Analysis 

4. A graph-theoretic approach to image analysis is to view a digital image 
as a set of nodes 1, . . . , n, say, representing the pixels. With each node i 
is associated a node “feature” y^, indicating for example the gray level or 
the RGB components of the pixel. The spatial position of node i in the 
image is denoted by z^. The “similarity” between two nodes i and j is 
specified via an edge (z,j), with weight or cost Cij. For example, the cost 




8.4 Exercises 269 



l|yi-y. 7 11 ^ \\zj-zj 11^ 

e ""y X e if ||z^ — Zj\\ < r, 

0 otherwise, 



(8.14) 



where cr^ and r are constants, captures both the feature similarity and 
the spatial proximity of nodes i and j. 

In image segmentation the objective is to partition the nodes into two 
or more “similar” groups. The goodness of the image partition can be 
measured via many different performance functions [51]. For the bipar- 
tition case Shi and Malik [157] suggested the following normalized cut 
performance function: 



5(x) = 



X/igVi(x),j€V2(x) ‘'*1 5Zi6Vi(x),j€V2(x) 



E. 



^evi(x) 



dj 



EjeV2(x) 



(8.15) 



where di = Cij denotes the total cost from i to all other nodes, and 
{Vi(x), V 2 (x)} is the partition corresponding to the binary cut vector 
X = {xi, ^ Xn), as in the max-cut problem. 

a) Segment a sample image into two segments (black and white), by 
maximizing (8.15) via the CE method. 

b) Compare the outcomes for a variety of CE parameters and model 
parameters in (8.14). 



5? Bayesian inference is a rich source of complicated optimization problems, 
which may be tackled via the CE method. Consider the following example 
in image analysis: An observed digital image consists of n “gray levels” 
y = (yi, . . . , 2/n) ^ Assume that the original image consists of only 
two gray levels, /xq and fii. However, the observed image y is contaminated 
by random and independent Gaussian noise with zero mean and variance 
(see Figure 8.5). In order to recover the original image we must infer 
from y the original gray levels (/^o or /xi) for each pixel i. 

Solving this problem is equivalent to two-class labeling of the observed 
image y. For each pixel i we assign a label Xi G {0, 1} that designates one 
of two gray levels A: = 0, 1 of the original uncorrupted image. 

A Bayesian approach to this problem proceeds in three steps. First, the a 
priori information about the unknown x is summarized by some density 
/(x). Second, given an original image x, the likelihood of obtaining data y 
is described by the conditional density /(y | x). Third, by Bayes’s formula, 
the posterior information about x given the data y is given by 

/(x|y) =c/(y|x)/(x) , 

where c is a normalization constant. The objective is to maximize this 
posterior probability. Defining the energies U{x) = — ln/(x), I/(x|y) = 
— In /(x I y) and U{y\x) = — In /(y | x), we see that maximizing the pos- 
terior probability with respect to x is equivalent to minimizing the poste- 
rior energy 




270 



8 Applications of CE to Machine Learning 




Fig. 8.5. A gray scale image corrupted by noise. 

5(x) = C/(x|y) = C/(y|x) + t/(x). (8.16) 

It thus remains to specify the likelihood /(y | x) and the a-priori density 
/(x), or equivalently the energies C/(y | x) and U (x). An often used model 
for the likelihood of the observations is to assume that given Xi = k the 
corrupted pixel value Yi has a Gaussian distribution with mean /ik and 
variance cr^. Thus, 

fiiVi \xi = k)= exp • (8.17) 

Assuming independent noise, the joint pdf for the entire image Y is given 
by /(y I x) = Hi Mvi I a;*), which implies 

?7(y |x) = -ln/(y|x) = EE 

fc=0 i=l 

where 6ik equals 1 when the pixel i is assigned to class k, and 0 otherwise. 
A well-known choice for the a-priori energy is 

n 

- Xjf , (8.19) 

2=1 jeMi 

where Afi is the set of neighborhood pixels of i, and the j3j are constants, 
independent of i. 

a) Select or construct a black and white test image. 

b) Corrupt the image with independent Gaussian noise. 

c) Reconstruct the image by solving (8.16) using the CE method. Con- 
sider two cases: (a) /xq and /xi are known in advance, and (b) /xq and 
/xi are unknown and have to be estimated. 



^(y* - 5ik + const, (8.18) 




A 



Example Programs 



In this appendix we give various Matlab implementations of CE algorithms. 
Devising and implementing CE algorithms for COPs is sometimes more of an 
art than a science. We would like to suggest a few strategies and “rules of 
practice” when making CE programs: 

1. Clearly specify the trajectory generation and updating rules for each prob- 
lem. If several alternative trajectory generation algorithms are found it is 
suggested to describe at least two of them. Present also the pseudocodes 
and the programs (in Matlab or C). 

2. Repeat all runs at least 5-10 times (starting each run from a different 
stream of random numbers) and present summary data similar to the 
tables in this book. 

3. Run, in addition, simulated annealing or/and genetic algorithms and com- 
pare the results with CE ones. 

4. Run the CE algorithm first on a simple artificial (synthetic) problem, 
where the solution is known in advance, and only then run more compli- 
cated cases, such as case studies on the Web. 

5. For each problem, associate a relevant artificial problem of size n =4-5 
and using Algorithm 4.2.2 — the deterministic version of Algorithm 4.2.1 
— calculate {(7t, v^)} analytically step by step, as in Example 4.7. 

6. To assess whether more complicated reward functions (see Section 5.2) are 
preferred, run CE Algorithm 4.2.1 using the standard indicator /{5(x)^7t} 
and with the alternative reward 7{5(x)^7t}'5'(X). Compare the results. 

7. Discuss briefly your results and make conclusions. 




272 A Example Programs 



Parameter setting for COPs 

For the main CE Algorithm 4.2.1 we suggest selecting 

Inn 



n 



ifn< 100, 



t 0.01 if n > 100 , 



and taking a G (0.3, 0.8). Depending on whether the problem is of SNN- or 
SEN-type, take the sample size N = Cn and Cn^, (5 < C < 10) respectively. 

To define C and a more accurately we suggest (for quite large networks) 
running an associated one of smaller size. For example, in a TSP problem 
with n = 200 cities, one could 

1. associate with the original TSP problem an auxiliary one with, say, 20 out 
of 200 randomly chosen cities, keeping the distance matrix between these 
20 cities the same as in the original problem; 

2. run the auxiliary problem for several combinations of C and a; 

3. adopt the best C and a (corresponding to the most accurate version) to 
the original problem. 

For the noisy version of Algorithm 4.2.1 (see Chapter 6) the sample size 
needs to be increased, e.g., from N = bn and A/" = 5n^, for SNN and SEN to 
N = 2b n and 25 n^, respectively. We keep again g = 0.01 and a G (0.3, 0.8). 
Finally, for the FACE modification of CE, Chapter 5, follow Algorithm 5.3.1. 



A.l Rare Event Simulation 

The Matlab function below can be used to reproduce the results of the first 
toy example of the tutorial; see Section 2.2.1. 

function toyl(u) 

% this function takes a vector of means, u, and outputs 
*/o an estimate of the probability that the min path length 
% exceeds gamma, and gives a table showing the parameters 
7o at each step 

tic 

v=u; 7o initialize v 
gamma=2 ; 

N = lO'S; % number of samples for a normal step 
Nl= 10^5; y* number of samples for the final step 
n=length(u) ; 

rho=0 . 1 ; % required later on 



g = 0; 




A.l Rare Event Simulation 273 



while (g < gammaO) 
y=-log(rand(N,n)) ; 
for k=l:N, 
y(k,:)=v.*y(k,:); 
end 

% calculate S 
S = S_len(y) ; 

[SS , SSidx] =sort (S) ; 
eidx=round((l--rho)*N) ; 
g=SS(eidx); % the (1-rho) quantile of S 
if g>=gainma, 
g=gamma; 

while SS(eidx)>=g, % succesively decrease eidx until we go below gamma 
eidx=eidx-l ; 
end 

eidx=eidx+l; % add one, and we are at the lowest eidx 
end 

W=ones(N, 1) ; 
for j=l:n, 

W=W.*exp(-y(:,j)*(l/u(j) - l/v(j)))*(v(j)/u(j)) ; 
end 

for j=l:n, % update v 

v(j)=sum(W(SSidx(eidx:N)) . *(y (SSidx (eidx: N) , j)))/sum(W(SSidx(eidx:N)) ) ; 
end 

disp([g,v]) 

end 

y=-log(rand(Nl ,n) ) ; 
for k=l:Nl, 

y(k, :)=v.*y(k, :); 
end 

S = S_len(y) ; %min(Sl ,S2) ; 

W=ones(Nl, 1) ; 
for j=l:n, 

W=W.*exp(-y(: , j)*(l/u(j) - l/v(j)))*(v(j)/u(j)) ; 
end 

l=sum((S>=g) .*W)/N1; 

std = sqrt((sum((S>=g) .♦W(:) ."2)/Nl - 1*1)/N1); 

RE = std/1; 

disp(C’RE: \num2str (RE)] ) 




274 A Example Programs 
disp([’l: \num2str(l)] ) 

fprintf('95 perc. Cl : (*/og,7og)\n\ 1 - 1*RE*1.96, 1+1*RE*1.96) 
toe 

return ; 



function s=S_len(X) 

*/, Current S(X) 

si = min(X(:,l) + X( : ,4) ,X( : , 1)+X( : ,3)+X( : ,5) ) ; 
s2 = min(X(:,2)+X(:,5),X(:,2)+X(:,3)+X(:,4)); 
s = min(sl,s2) ; 



A. 2 The Max-Cut Problem 

The following Matlab program has been used to produce the results of the synthetic 
max-cut problem in Table 2.4.1 and Figures 2.3 and 2.4. 

•/* A Matlab demonstration of the CE method for 
*/, solving the synthetic MAXCUT problem 

ClecLT 

*/o Setting parameters 
m = 200; */, (2m nodes total) 

N = 1000; 7o sample size 
rho = 0.1; 

maxlters=23; */* Number of iterations 

*/, set random seed 
rand( ' state ' , 1234) ; 

*/o first construct a random cost matrix 
Zll = rand(m) ; 
for i = l:m, 
for j =i:m, 

Zll(j,i) = Zll(i,j); 
end; 

Zll(i,i) = 0; 
end 

Z22 = rand(m) ; 
for i = l:m, 
for j =i:m, 

Z22(j,i) = Z22(i,j); 
end; 

Z22(i,i) = 0; 
end 



B12 = ones(m); 




A. 2 The Max-Cut Problem 275 



B21 = ones(m); 

C = [Zll B12; B21 Z22] ; % cost matrix 



% Calculating optimal score 

optp = [ones ( 1 , m) , zeros ( 1 , m) ] 
y = rand(l,2*m) ; 

X = (y <optp) ; 7, generate cut vector, according to p 

s = scut (C,x) ; 

fprintf (’Best score */*f\n’,s) 

^Initializing 
p = 1/2 ♦ ones (l,2*m); 
p(l) = 1; 

gammas = zeros ( 1 ,mcLxIters) ; 
curbests = zeros (1 ,maxlters) ; 
ps = zeros (maixlters , 2*m) ; 
pdist = zeros ( 1, maxiters) ; 

tic 

curbest=0 . 0 ; 

for j = (l : maxiters) 7, main CE loop 
j 7o output interation number 
7o generate matrix X of cutvectors 
Y = rand(N,2*m); 

X = (Y < ones(N, l)*p) ; 
g = zeros (N, 1) ; 
for i=l:N 

g(i,:) = scut(C,X(i, :)) ; 

end 

[sortcut jSortidx] = sort(g); 

7o sortidx(l) contains index of the smallest 
7o cutvector 

7o update gamma 

eidx = ceil((l-rho)*N) ; 7o smallest index of best 
gamma = sortcut (eidx) ; 

curbest = max(curbest, sort cut (end)) 7okeep track of best result 
gammas ( j ) =gamma ; 
curbests(j) = curbest; 

pdist(j) = min(norm(p -optp) , norm (p - ~optp)); 

7o update p on basis of elite vectors, that is, the vectors 
7o X(sortidx(eidx) , :) to X(sortidx(N) , :) 
for i=l:2*m 

p(i)= sum(X(sortidx(eidx:N) ,i))/(N-eidx+l) ; 
end 

ps(j, :) = p; 




276 A Example Programs 



end 

toe y«end timing 

y, gammas is the calculated gamma 

y, pdist is the distance from the optimal p 

f inalp = p 

fprintf (’Calculated best score %f \n’ ,curbest) ; 
figure(l) ; 

plot ( 1 : maxiters , gammas) ; 
xlabel ( ’ t ’ ) ; ylabel ( ’ \gamma_t ’ ) ; 
figure (4) ; 

plot (1 : maxiters, pdist) ; 

xlabel (’ Number of iteration’ ) ;ylabel( ’ I lp"P_o_p_t ll_2’); 
figure (2) ; 

subplot (10, 1,1) ; bar(l/2*ones(l,2*m)) ; 
for j=l:9 

subplotdO,!, j+1) ; bar(ps(j , : )) ; 

end 

f igure(3) ; 
for j=l:10 

subplotdO, l,j) ; bar(ps(j+9, : )) ; 

end 

for i = 1: maxiters 

fprintf(’y«d '/of '/of '/of\n’ ,i, gammas (i) ,curbests(i) , pdist(i)); 
end 



function [perf] = scut(C,x) 

VI = find(x); '/ {VI, V2} is the partition 
V2 = find(~x); 

perf = sum(sum(C(Vl,V2))) ; '/ the size of the cut 



A. 3 Continuous Optimization via the Normal 
Distribution 

The program below illustrates how to optimize a general continuous multi-extremal 
function via the CE method, using “normal” updating rules. Typical use: 
cecont (-6,10,0.1,0.7, 100 , 0 . 05 , ’ S ’ ) , with S . m a certain function, for example as 
below; see also Example 5.1. 

'/, 1. mu - initial mean 

'/o 2. sigma - initial standard deviation 

'/o 3. rho - rarity parameter 

'/o 4. alpha - smoothing parameter 

'/o 5. N - sample size 

% 6. S - performance function 




A.4 FACE 277 



*/, 7. eps - tolerance 
7, 

function cecont (mu , sigma , rho , alpha , N , eps , FUN) 
tic 

t=0; 7, iteration counter 
while sigma > eps 
t=t+l; 

x= mu + sigma*randn(N, 1) ; 

SX=feval(FUN,x) ; 7* Compute the performance. 
sortSX=sortrows( [x SX] ,2) ; 
mul=mean(sortSX((l”rho)*N:N, 1)) ; 

mu=alpha. ♦mul+C 1-alpha) . *mu; 7o smoothed updating of mu 
sigmal = std(sortSX((l-rho)*N:N,D) ; 

sigma=alpha*sigmal+( 1-alpha) *sigma; 7o smoothed updating of sigma 
f print f ( * 7og 7o3 . 4f 7o3 . 4f 7o3 . 4f \n ' , t , f e val (FUN , mu) , mu , sigma) 
end 
toe 

f\mction res=S(x); 

res =exp(-(x-2) . "2) + 0.8*exp(-(x+2) . "2) ; 



A.4 FACE 

The Matlab program q.m below is an example of a FACE implementation for the 
n-queen problem of Exercise 6 of Chapter 2. To run, simply type q in the Matlab 
shell, or call the program with your own selection of parameters. 

function B=q(Ne , Nmin , Nmax , alpha , d , c , n) 

7. A program to solve the Queen Problem 
% via the Fully Adaptive Cross-Entropy method 

7. 

7o Syntax: B=q(Ne, Nmin, Nmax, alpha, d,c,n) 

7. 

7o Inputs - 

7* Ne: Number of elite samples to use 

7* Nmin: Minimum number of samples to use (must be >= Ne) 

7* Nmax: Maximum number of samples to use (must be >= Ne) 

7* alpha: Smoothing Parameter 

7. 

7. Extra Inputs - 

% d: number of consec. S_t"*’s with no gamma.hat^t improvement 

% c: number of Nt*s=Nmax in a row 

7o n: n queens on an nxn board 

% 

% Output - 

7* B: An nxn matrix with queen's denoted as I's, and blank 

7o squares denoted as O's 




278 A Example Programs 



*/, some default parameters if they are not supplied 

if nargin<7,n=8;end 

if nargin<6 , c=4 ; end 

if nargin<5 , d=6 ; end 

if nargin <4 , alpha=0 . 7 ; end 

if nargin ==0,Ne = 50; end 

if nargin <2,Nmin = 10*Ne;Nmax = 50*Ne;end 

if (Nmin<Ne) I I (Nmax<Ne) I I (Nmax<Nmin) , 

fprintf (’Please give a valid (Ne,Nmin,Nmax) tuple. \n’) 

B=0 ; return ; 

end 

N=Nmin; % start sampling with the minimum allowed 
st=rand (d , 1 ) +n*n ; 
nt=rand(c, 1) ; 
gt=100; 

v=(l/n)*ones(n) ; 7, initial probabilities 

iii=l; 

ind=l; 

count=l ; 

ccc=0; 

while iii'“=0, 

if (N+ccc)>=Nmax, 

if (sum(nt>=Nmax)==c) , % we have reached the limit here 
fprintf (’unreliable solution! \n’ ) 
break; 
end 
end 

s= □ ; 7, set up score vector 

for kk=l:N, 7o calculate and score N_t queen layouts 
B=genq(v,n) ; 

S=scoreq(B,n) ; 
s=[s;S] ; 

V(:,:,kk)=B; 

end 

[s,I]=sort (s) ; 7* Sort the scores 
gtold=gt ; 
gt=s(Ne) ; 

V=V(: , : ,I(l:Ne)) ; 7o the elite samples 
if s(l)==0, 7* if we have a valid solution 

fprintf(’t = 7od , N^t = 7«d , g_hat_t = 7og , S_star_t = 7od\n’ , 
count ,N,s(Ne) ,s(l)) 
fprintf ( ’valid solution! \n’ ) 
break; 
end 

if s(Ne)<gtold, 

st=rand(d, l)+n*n; 7« reset counting for s_star_t 
end 

if (gt>=gtold)&&(s(l)>=st(d)) , 
if (sum(st==s(l))==d) , 




A.4 FACE 279 



fprintf ( ’reliable solution. \n’ ) 
break; 
else 
ccc=0; 
s=s(l :Ne) ; 

fprintf (’N_t: ‘/.d ’,Nmin) 
while ind"=0, 

if Nmin==Nmax , break ; end 
B=genq(v,n) ; 

S=scoreq(B,n) ; 
ccc=ccc+l; 
if S<s(Ne), 
s=[s;S] ; 

V(:,:,Ne+l)=B; 

[s,I]=sort(s) ; 
s=s(l :Ne) ; 

V=V(:,:,I(l:Ne)); 
if (s(l)==0), 

fprintf(’\nt = */,d , N_t = */,d , g_hat_t = */,g , 

S_star_t = */,d\n’, count,N+ccc,s(Ne) ,s(l)) 
fprintf (’valid solution! \n’ ) 

B=V(:,:,1); 

return ; 
end 
end 

if mod(ccc,5*Nmin)==0, 
f printf ( ’ \nN_t : ’ ) 
end 

if mod(ccc,Nmin)==0, 
fprintfC’ */od ’ ,N+ccc) 
elseif mod(ccc,round(Nmin/5))==0, 
f printf ( ’ . ’ ) 
end 

if s(Ne)<gtold, 

st=rand(d, l)+n*n; "/, reset counting for s_star_t 
break; 
end 

if ((N+ccc)>=NmcLx) I I (s(l)<st(d)) , 
break; 
end 
end 

gt=s(Ne) ; 

w=(l/Ne)*sum(V,3) ; 
v=alpha*w+ ( 1 -alpha) *v ; 

fprintfC ’\nt = */.d , N_t = */»d , g_hat_t = 7,g , S_star_t = */,d\n’ , 
count ,N+ccc,s(Ne) ,s(l)) 

N=Nmin ; 

count = count+1; 

St (1 : (d-l))=st (2:d) ; 




280 A Example Programs 



st(d)=s(l) ; 
nt(l: (c-l))=nt(2:c) ; 
nt (c)=N+ccc; 
end 
else 

if gt<gtold, 
nt=rand(c, 1) ; 
end 

w=(l/Ne)*sum(V,3) ; 
v=alpha*w+ ( 1 -alpha) *v ; 

fprintfC't = 7,d , N_t = 7,d , g_hat_t = 7,g , S_star_t = */,d\n\ 
count ,N,s(Ne) ,s(l)) 

N=Nmin; 

count = count+1; 
st(l:(d-l))=st(2:d); 
st(d)=s(l) ; 
nt(l : (c-l))=nt(2:c) ; 
nt(c)=N; 
end 
end 






function B=genq(v,n) 

*/, generate an outcome based on probabilities v 
*/, where v is an nxn matrix of probs. 

7, note: we require one queen per row 
7# each row of v MUST sum to > 0 
Z=zeros (n) ; B=Z ; 
r=rand(n, 1) ; 
if mod(n,2)==l, 7#odd 
h=(n-l)/2; 



else 



h=n/2; 

end 

for i=l:n, 7. see where to add the next queen 
for j=l:n, 

Z(i, j+l)=Z(i, j)+v(i,j) ; 
if ((Z(i, j)<=r(i)) && (r(i)<Z(i, j+1))) , 
if ((i==l)&&(j>(ceil(n/2)))) , 
B(i,j-li)=l; 

else 



B(i, j)=l; 

end 



end 



end 



end 



r/.7,7.7,7X/.7.7.r/.7//.7//.7,7.7.7.7X/.7,7.7,%^ 




A. 5 Rosenbrock 281 



function s=scoreq(B,n) 

*/, Compute the score for an nxn chessboard, 

% ie: Score is the number of "hits" between queens 
s=0; */, initial score 

*/, score cols and not rows, because of the way we gen outcomes 
a=sum(B, 1) ; 
s=s+sum( (a(a>l) -1) ) ; 

7, score diagonal hits 
a=sum(spdiags(B) , 1) ; 
s=s+sum((a(a>l)-l)) ; 
a=sum(spdiags(B(n:-l: 1, :)),!); 
s=s+sum((a(a>l)-l)) ; 



A. 5 Rosenbrock 

The following Matlab script was used to find the global optimum of the Rosenbrock 
function in Section 5.4. The program uses a fixed number of (elite) samples to up- 
date the parameters of the normal distributions. It also implements the modified 
smoothing scheme in (5.1). The algorithm simply stops after a total of tjoaax iter- 
ations. To apply the program to the trigonometric function simply uncomment the 
relevant line below. 

clear all 

format short e 

7o t : iteration number 

7o t_max : total number of iterations 

7* N : sample size 

7o Gamma: vector of gammas 

7o Alf : smoothing parameter; 

7o N_E1 : number of elite samples 
7o n : dimension 

7* Betta : parameter used in modified smoothing 
7o q : parameter used in modified smoothing 
7a D : vector of variances 
7o M : vector of means 

7o X : Nxn matrix, containing a sample of N n-dimensional vectors 
7o SA : Performance of X 
7o S_Sort : vector of sorted performances 
7a S_Best : best performance 

7a V_Sort : vector of indices of sorted performances 
7a X_Gamma : performace of worst of elite samples 
7a X_Best : performance of the best of the elite samples 
N = 1000, n = 10, N_E1 = 20, Alf = 0.8 , Betta = 0.9, q = 6 
to = 1, t_max = 2000; 

M = -2 + 4*rand(l,n); 7a random initial Mean 
M_Last = M; 

D = 10000. 0*ones ( 1, n) ; 

D_Last = D; 




282 A Example Programs 



X_Best_Last = zeros (l,n); 

X_Gaimna_Last = zeros(l,n); 
tic 

for t = tO:t_max, 

M = Alf*M + (l"Alf)*M_Last; 

B_Mod = Betta - Betta*(l-l/t) "q; 

D = B_Mod*D + (l-B_Mod)*D_Last ; 

R = randn(N,n) ; 

Std = D."0.5 ; 

for V = 1:N, 

X(v,:) = M + Std.*R(v,:); 
end; 

SA = RosPen(N,n,X) ; */, call the Rosenbrock function (with penalty) 
*/,SA = trigo(N,n,X) ; */, call the trigonometic function 
[S^Sort ,V_Sort] = sort(SA); 

X.Gamma = X(V_Sort (N.El) , : ) ; 

X.Best = X(V_Sort(l),:); 

M.Current (t , : ) = M ; 

D_Current (t , : ) = D ; 

Gamina(t) = S_Sort (N_E1) ; 

S_Best(t) = S_Sort(l); 

S_Min(t) = min(S_Best) ; 

Glob^Min = S_Min(t) ; 
t_Last = t; 

M^Last = M; 

D_Last = D; 

X.Best.Last = X_Best; 

X_Gaimna_Last = X^Gamma; 

X_E1 = X(V_Sort(l:N_El),:); 

M = mean(X_El) ; 

D = (std(X_El) .^2); 
if mod(t , 100)==0 

f pr intf ( ’ */*d */*5.3f */,5.3f' ,t ,Gamma(t) , S_Best(t)); 

fprintf (' 7,6.2f \M) 

fprintf ('\nO 
end; 
end 
toe 
M 

S_Best(t) 






function S = RosPen(N,n,X) ‘/.Rosenbrock function with penalty 
X_High = 2*ones(N,n); X_Low = -2*ones(N,n) ; 

X_Pen = max (X , X^High) - X_High + X_Low - min(X_Low,X) ; 

S = 1000000000. 0*X_Pen(:,l); 




A.6 Beta Updating 283 



for i=l:(n-l) , 

S.i = 100*(X(:,(i+l))-X(:,i).^2).^2 + (X( : . ^2; 
S_Pen = 1000000000. 0*X_Pen(:,(i+D); 

S = S + S_i + S_Pen; 
end; 






function S = trigo(N,n,X) '/.trigonometric function 
S=ones(N, 1) ; 
for i=l:n , 

S_i = 8*sin(7*(X(: ,i)- 0.9) . *'2) . '*2 + ... 

+ 6*sin(14*(X(:,i)- 0.9) . ''2) . ^^2 + (X(:,i) - 0.9). '‘2; 
S=S + S_i; 
end; 



A.6 Beta Updating 

The following program computes the value and derivative of the logarithm of the 
gamma function at x. Its use, in conjunction with the second program in this section, 
is illustrated in Exercise 5.4. 

function [f ,df]=slava(x) ; 
if (x <= 0) 
f = inf; 
df = [] ; 

end; 

dx = l.e-10; 
f = gammaln(x) ; 
if (x<l) 

y = x+1; 

df = (gammaln(y+dx)-gainmaln(y))/dx“l/x; 
elseif (x<2) 

df = (gammaln(x+dx)“gammaln(x))/dx; 
elseif (x<200) 
df = 0; 
y = x; 
while (1) 

y = yi; 

df = df+l/y; 
if (y<2) 

df = df+(gammaln(y+dx)-gammaln(y))/dx; 
break; 

end; 

end; 

else 

df = log(x-l)+0.5/(x-l); 

end; 




284 A Example Programs 



The function below calculates the solution to (5.11) and (5.12), by implementing 
the ellipsoid method. The input parameters pp and qq are equal to minus the quotient 
of the sums in (5.11) and (5.12), respectively. 

function [alpha, beta] = SlavaEll (pp , qq) ; 
n = 2; 

e = zeros (n, 1) ; 
p = zeros (n, 1) ; 
q = zeros (n, 1) ; 
blx = [l.e-6;l.e-6] ; 
bux = [I.el2;l.el2] ; 

X = 0.5* (blx+bux) ; 
rad = 0.5*norm(blx-bux) ; 

B = rad*eye (n) ; 
rep.cc = -1; 
free = inf; 
fmod = -inf ; 
df = zeros (2, 1) ; 
contr. print = 1; 
contr.eps = l.e-12; 
for itr = 1:1000, 
flag^c = 1; 

aa = sum((blx>x) .*(blx-x)) ; 
if aa, 

e = -(x<blx) ; 
flag_c = 0; 

end; 

if flag_c, 

aa = sum( (x>bux) . * (x-bux) ) ; 
if aa, 

e = (x>bux) ; 
flag^c = 0; 

end; 

end; 

if flag_c, 

al = x(l); 
bt = x(2) ; 

[f sum,df sum] = slava(al+bt) ; 

[fal,dfal] = slava(al); 

[fbt,dfbt] = Slava (bt ) ; 

df(l) = -dfsum + dfal + pp; 

df(2) = -dfsum + dfbt + qq; 

f = -fsum + fal + fbt + pp*x(l) + qq*x(2); 

if f<frec, 

alpha = al; 
beta = bt; 
free = f; 
if contr. print, 

y, disp(sprintf (’itr# %3d rec = */o.6e’ , itr, free)) ; 




A. 7 Banana Data 285 



end; 

end; 

aa = f-frec; 
e = df; 

end; 

p = B'*e; 
pn = norm(p) ; 
if pn == 0, 
break; 

end; 

if flag_c, 

fmod = max(fmod,f -pn) ; 

end; 

if frec-fmod<contr .eps*meLx(l,abs(frec)) , 
break; 

end; 

p = p/pn; 
aln = aa/pn; 
if (aln>l)&(frec==inf ) , 
break; 

end; 

if aln> = 1, 
aln = 0.5; 

end; 

cfo = sqrt((l-aln‘"2)*n'‘2/(n''2-l)) ; 

cfi = cfo*(l-sqrt((l“aln)*(n”l)/((l+aln)*(n+l)))) ; 

shift = (l+n*aln)/(l+n) ; 

q = B*p; 

X = X - shift^q; 

B = cfo*B - cfi*q*p^; 

end; 

[gsum , dgsum] = slava(alpha+beta) ; 

[galjdgal] = slava ( alpha) ; 

[gbtjdgbt] = slava(beta) ; 

resl = (dgsum-dgal-pp) /max ( 1 , abs (pp) ) ; 

res2 = (dgsum-dgbt-qq) /max ( 1 , abs (qq) ) ; 

%disp(sprintf ( ’ ***Residuals : 7, . 3e 7, . 3e ^ ,resl ,res2) ) ; 



A. 7 Banana Data 

Banana shaped data sets occur frequently in clustering and classification test prob- 
lems. Below is a simple Mat lab function that generates such data. 

function b = banana (n, sigma, radius, thetal,theta2) 

7« Generate a Banana-shaped data set 

7, n - ntimber of data points (default: 200) 

sigma - move points according to a normal dist. with this 




286 A Example Programs 



% standard deviation in both x and y directions 

% (default: 1) 

% radius - the radius of the circle (of which the "banana" is 
% an arc (default: 5) 

y, thetal - starting angle of the arc (default: 9*pi/8) 

% theta2 - ending angle of the arc (default: 19*pi/8) 
if nargin<4,thetal=pi + pi/8; end 
if nargin<5,theta2=(5*pi/2 - pi/8) - thetal; end 
if narginO , radius=5 ; end 
if nargin<2,sigma=l;end 
if nargin<l,n=200;end 

randn(’seed\ 123456789); '/.optional 
rand(^seed\ 987654321); '/.optional 

angles=thetal + rand(n, l)*theta2; '/. angles between pi/8 and ll*pi/8 
b=radius. * [cos (angles) , sin (angles) ] ; '/. transform these to random points 

'/. on an arc of a circle 

b=b+sigma*randn(n,2) ; '/. shift these points off the arc ind. 

'/. normally in x and y directions 




References 



1. E. H. L. Aarts and J. H. M. Korst. Simulated Annealing and Boltzmann Ma- 
chines. John Wiley & Sons, Chichester, 1989. 

2. E. H. L. Aarts and J. K. Lenstra. Local Search in Combinatorial Optimization. 
John Wiley & Sons, Chichester, 1997. 

3. T. S. Abdul-Razaq, C. N. Potts, and L. N. Van Wassenhove. A survey of 
algorithms for the single machine total weighted tardiness scheduling problem. 
Discrete Applied Mathematics^ 26:235-253, 1990. 

4. W. A. T. W. Abdullah. Seeking global minima. Journal of Computational 
Physics., 110:320, 1994. 

5. I. Adan and J. van der Wal. Monotonicity of the throughput in single server 
production and assembly networks with respect to the buffer sizes. In H. G. 
Perros and T. Altiok, editors. Queueing Networks with Blocking^ pages 345- 
356. Elsevier Science, 1989. 

6. R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows Theory, Algo- 
rithms and Applications. Prentice Hall, Englewood Cliffs, 1993. 

7. V. M Aleksandrov, V. I. Sysoyev, and V. V. Shemeneva. Stochastic optimiza- 
tion. Engineering Cybernetics, 5:11-16, 1968. 

8. S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one 
distribution from another. Journal of the Royal Statistical Society, Series B, 
28:131-142, 1966. 

9. G. Alon, D. P. Kroese, T. Raviv, and R. Y. Rubinstein. Application of the 
cross-entropy method to the buffer allocation problem in a simulation-based 
environment. Annals of Operations Research, Kluwer Academic, 2004. 

10. S. Andradottir. A method for discrete stochastic optimization. Management 
Science, 41:1946-1961, 1995. 

11. S. Andradottir. A global search method for discrete stochastic optimization. 
SIAM J. Optimization, 6:513-530, 1996. 

12. S. Asmussen. Applied Probability and Queues. John Wiley Sz Sons, 1987. 

13. S. Asmussen and K. Binswanger. Simulation of ruin probabilities for subexpo- 
nential claims. AS TIN Bulletin, 27(2):297-318, 1997. 

14. S. Asmussen, K. Binswanger, and B. Hpjgaard. Rare events simulations for 
heavy-tailed distributions. Bernoulli, 6:303-322, 2000. 

15. S. Asmussen, D. P. Kroese, and R. Y. Rubinstein. Heavy tails, importance 
sampling and cross-entropy. Submitted, 2003. 




288 References 



16. S. Asmussen and R. Y. Rubinstein. The efficiency and heavy traffic proper- 
ties of the score function method in sensitivity analysis of queueing models. 
Advances of Applied Probability, 24(1):172-201, 1992. 

17. S. Asmussen and R. Y. Rubinstein. Response surface estimation and sensitivity 
analysis via the efficient change of measure. Stochastic Models, 9:313-339, 1993. 

18. S. Asmussen and R. Y. Rubinstein. Complexity properties of steady-state 
rare-events simulation in queueing models. In Advances in Queueing: Theory, 
Methods and Open Problems, pages 429-462. CRC Press, 1995. 

19. A. Barto, R. Sutton, and C. Anderson. Neuron-like adaptive elements that 
can solve difficult learning control problems. IEEE Transactions on Systems, 
Man, and Cybernetics, 13:834-846, 1983. 

20. J. Baxter, P. L. Bartlett, and L. Weaver. Experiments with infinite-horizon, 
policy-gradient estimation. Journal of Artificial Intelligence Research, 15:351- 
381, 2001. 

21. R. E. Bellman. On a routing problem. Quarterly of Applied Mathematics, 
16(87-90), 1958. 

22. A. Ben-Tal and A. Nemirovski. Lectures on Modem Convex Optimizations. 
SIAM, Philadelphia, 2001, pp. 342-352. 

23. J. M. Bernardo and A. F. M. Smith. Bayesian Theory. John Wiley &: Sons, 
New York, 1994. 

24. D. P. Bertsekas and R. Gallager. Data Networks. Prentice Hall, Englewood 
Cliffs, 1992. 

25. D.P. Bertsekas. Dynamic Programming and Optimal Control, volume I. Athena 
Scientific, 1995. 

26. D. J. Bertsimas. A vehicle routing problem with stochastic demand. Operations 
Research, 40:574-585, 1992. 

27. M. Den Besten, T. Stiitzle, and M. Dorigo. Ant colony optimization for the 
total weighted tardiness problem. Parallel Problem Solving from Nature, pages 
611-620, 2000. 

28. J. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. 
Plenum Press, New York, 1981. 

29. B. Bollobas. Graph Theory. Springer- Verlag, Berlin, 1979. 

30. V. S. Borkar and S. P. Meyn. The O.D.E. method for convergence of stochas- 
tic approximation and reinforcement learning. SIAM J. Control Optim., 
38(2):447-469, 2000. 

31. P. Boyle, O. Broadie, and P. Glasserman. Simulation methods for security 
pricing. J. Economic Dynamics and Control, 21:1267-1321, 1997. 

32. J. A. Bucklew. Large Deviation Techniques in Decision, Simulation and Esti- 
mation. John Wiley &: Sons, New York, 1990. 

33. J. A. Buzacott and J. G. Shanthikumar. Stochastic Models of Manufacturing 
Systems. Prentice Hall, Englewood Cliffs, 1993. 

34. P. D. Carr, E. Cheah, P. M. Suffolk, S. G. Vasudevan, N. E. Dixon, and D. L. 
Ollis. X-ray structure of the signal transduction protein from Escherichia Coli 
at 1.9 A. Acta Crystallogr. D, 52:93-104, 1996. 

35. M. Charikar, S. Huller, and B. Raghavachari. Algorithms for capacitated ve- 
hicle routing. SIAM Journal on Computing, 31:665-682, 2001. 

36. K. Chepuri and T. Homem de Mello. Solving the vehicle routing problem with 
stochastic demands using the cross entropy method. Annals of Operations 
Research, Kluwer Academic, 2004. 




References 289 



37. I. Cohen, B. Golany, and A. Shtub. Managing stochastic finite capacity multi- 
project systems through the cross-entropy method. Annals of Operations Re- 
search, Kluwer Academic, 2004. 

38. H. Cohn and M. Fielding. Simulated annealing searching for an optimal tem- 
perature schedule. SIAM Journal of Optimization, 9(3):779-802, 1999. 

39. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley 
& Sons, New York, 1991. 

40. H. A. J. Crauvels, C. N. Potts, and L. N. Van Wassenhove. Local search 
heuristics for the single machine total weighted tardiness scheduling problem. 
INFORMS Journal of Computing, 10(3):341-350, 1998. 

41. Y. Dallery, Z. Liu, and D. Towsley. Equivalence, reversibility, symmetry and 
concavity properties in fork/join queueing networks with blocking. Journal of 
the ACM, 41(5):903-942, 1994. 

42. G. B. Dantzig and R. H. Ramser. The truck dispatching problem. Management 
Science, 6:80-91, 1959. 

43. P. T. de Boer. Analysis and efficient simulation of queueing models of telecom- 
munication systems. PhD thesis. University of Twente, 2000. 

44. P. T. de Boer, D. P. Kroese, S. Manner, and R. Y. Rubinstein. A tutorial on 
the cross-entropy method. Annals of Operations Research, Kluwer Academic, 
2004. 

45. P. T. de Boer, D. P. Kroese, and R. Y. Rubinstein. Estimating buffer over- 
fiows in three stages using cross-entropy. In Proceedings of the 2002 Winter 
Simulation Conference, San Diego, pages 301-309, 2002. 

46. P. T. de Boer, D. P. Kroese, and R. Y. Rubinstein. A fast cross-entropy 
method for estimating buffer overfiows in queueing networks. Management 
Science, 2004. 

47. L. Devroye. Non-Uniform Random Variate Generation. Springer- Verlag, New 
York, 1986. 

48. E. W. Dijkstra. A note on two problems in connection with graphs. Numerische 
Math., 1:269-271, 1959. 

49. M. Dorigo, V. Maniezzo, and A. Colorni. The ant system: optimization by 
a colony of cooperating agents. IEEE Transactions on Systems, Man, and 
Cybernetics — Part B, 26(1):29-41, 1996. 

50. U. Dubin. The cross-entropy method for combinatorial optimization with ap- 
plications. Master’s thesis. The Technion, Israel Institute of Technology, Haifa, 
June 2002. 

51. U. Dubin. Application of the cross-entropy method for image segmentation. 
Annals of Operations Research, 2004. Submitted. 

52. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley 
Sz Sons, New York, 2001. 

53. T. Elperin, I. B. Gertsbakh, and M. Lomonosov. Estimation of network reliabil- 
ity using graph evolution models. IEEE Transactions on Reliability, 40(5) :572— 
581, Dec 1991. 

54. P. Embrechts and N. Veraverbeke. Estimates for the probability of ruin with 
special emphasis on the possibility of large claims. Insurance Mathematics and 
Economics, 1:55-72, 1982. 

55. W. Feller. An Introduction to Probability Theory and Its Applications, volume I. 
John Wiley Sz Sons, 2nd edition, 1970. 

56. L. R. Ford. Network fiow theory. Technical Report P-923, RAND Corporation, 
Santa Monica, California, 1956. 




290 References 



57. M. R. Garey and D. S. Johnson. Computers and Intractability: A guide to 
the theory of NP-completeness. W. H. Freeman and Company, San Francisco, 
1979. 

58. M. J. J. Garvels and D. R Kroese. A comparison of RESTART implementa- 
tions. In Proceedings of the 1998 Winter Simulation Conference^ pages 601-609, 
Washington, DC, 1998. 

59. M. J. J. Garvels, D. R Kroese, and J. C. W. van Ommeren. On the importance 
function in RESTART simulation. European Transactions on Telecommunica- 
tions^ 13(4), 2002. 

60. S. B. Gelfand and S. K. Mitter. Simulated annealing with noisy or imprecise 
energy measurements. JOTA, 62:49-62, 1989. 

61. S. B. Gershwin and J. E. Schor. Efficient algorithms for buffer space allocation. 
Annals of Operations Research, 93:117-144, 2000. 

62. I. B. Gertsbakh. Statistical Reliability Theory. Marcel Dekker, Inc., New York, 
1989. 

63. R Glasserman, R Heidelberger, R Shahabuddin, and T. Zajic. A look at 
multilevel splitting. In H. Niederreiter, editor, Monte Carlo and Quasi Monte 
Carlo Methods 1996, Lecture Notes in Statistics, volume 127, pages 99-108. 
Springer- Ver lag. New York, 1996. 

64. R Glasserman, R Heidelberger, P. Shahabuddin, and T. Zajic. A large devi- 
ations perspective on the efficiency of multilevel splitting. IEEE Transactions 
on Automatic Control, 43(12): 1666-1679, 1998. 

65. P. Glasserman, P. Heidelberger, P. Shahabuddin, and T. Zajic. Multilevel split- 
ting for estimating rare event probabilities. Operations Research, 47(4) :585- 
600, 1999. 

66. F. Glover and M. Laguna. Modem Heuristic Techniques for Combinatorial Op- 
timization, chapter Chapter 3: Tabu search. Blackwell Scientific Publications, 
Oxford, 1993. 

67. P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. 
Communications of the ACM, 33(10):75-84, 1990. 

68. P. W. Glynn and D. L. Iglehart. Importance sampling for stochastic simula- 
tions. Management Science, 35:1367-1392, 1989. 

69. D. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learn- 
ing. Addison Wesley, 1989. 

70. W. B. Gong, Y. C. Ho, and W. Zhai. Stochastic comparison algorithm for dis- 
crete optimization with estimation. In Proceedings of the 31st IEEE Conference 
on Decision and Control, pages 795-800, 1992. 

71. C. Gorg. Simulating rare event details of ATM delay time distributions 
with RESTART/LRE. In Proceedings of the RESIM Workshop. University 
of Twente, The Netherlands, March 1999. 

72. C. Gorg and O. Fufi. Comparison and optimization of RESTART run time 
strategies. AEU, 52(3): 197-204, 1998. 

73. G. R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Ox- 
ford University Press, Oxford, 3rd edition, 2001. 

74. D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University 
Press, Cambridge, 1997. 

75. W. J. Gutjahr. A generalized convergence result for the graph-based ant sys- 
tem meta-heuristic. Technical Report 91-016, Dept, of Statistics and Decision 
Support Systems, University of Vienna, Austria, 2000. 




References 291 



76. W. J. Gutjahr. A graph-based ant system and its convergence. Future Gener- 
ations Computing^ 16:873-888, 2000. 

77. W. J. Gutjahr. Ant algorithms with the guaranteed convergence to the optimal 
solution. Technical report, Dept, of Statistics and Decision Support Systems, 
University of Vienna, Austria, 2001. 

78. W. J. Gutjahr and G. C. Pflug. Simulated annealing for noisy cost functions. 
Journal of Global Optimisation, 8:1-13, 1996. 

79. Z. Haraszti and J. Townsend. Rare event simulation of delay in packet switching 
networks using DPR-based splitting. In Proceedings of the RESIM Workshop, 
11-12 March 1999, pages 185-190. University of Twente, the Netherlands, 1999. 

80. W. K. Hastings. Monte Carlo sampling methods using Markov chains and their 
applications. Biometrika, 57:92-109, 1970. 

81. C. Heavey, H. Y. Papadopoulos, and J. Browne. The throughput rate of mul- 
tistation unreliable production lines. European Journal of Operation Research, 
68:69-89, 1993. 

82. P. Heidelberger. Fast simulation of rare events in queueing and reliability 
models. ACM Transactions on Modeling and Computer Simulation, 5:43-85, 
1995. 

83. B. E. Helvik and O. Wittner. Using the cross-entropy method to guide/govern 
mobile agent’s path finding in networks. In 3rd International Workshop on 
Mobile Agents for Telecommunication Applications - MATA’Ol, 2001. 

84. D. S. Hirschberg. Algorithms for the longest common subsequence problem. 
J. ACM, 24:664-675, 1977. 

85. T. Homem-de-Mello and R. Y. Rubinstein. Rare event probability estimation 
for static models via cross-entropy and importance sampling. Submitted. 

86. R. Horst, P. M. Pardalos, and N. V. Thoai. Introduction to Global Optimization. 
Kluwer Academic, 1996. 

87. K-P. Hui, N. Bean, M. Kraetzl, and D. P. Kroese. The cross-entropy method 
for network reliability estimation. Annals of Operations Research, Kluwer Aca- 
demic, 2004. 

88. M. R. Jerrum and A. J. Sinclair. Approximating the permanent. SIAM Journal 
on Computing, 18:1149-1178, 1989. 

89. M. R. Jerrum and A. J. Sinclair. Polynomial-time approximation algorithms 
for the Ising model. SIAM Journal on Computing, 22:1087-1116, 1993. 

90. N. L. Johnson and S. Kotz. Um Models and Their Applications. John Wiley 
& Sons, New York, 1977. 

91. S. Juneja and P. Shahabuddin. Simulating heavy tailed processes using de- 
layed hazard rate twisting. ACM Transactions on Modeling and Computer 
Simulation, 12:94-118, 2002. 

92. L. P. Kaelbling, M. Littman, and A. W. Moore. Reinforcement learning - a 
survey. Journal of Artificial Intelligence Research, 4:237-285, May 1996. 

93. H. Kahn and T. E. Harris. Estimation of Particle Transmission by Random 
Sampling. National Bureau of Standards Applied Mathematics Series, 1951. 

94. Y. M. Kaniovski, A. J. King, and R. J.-B. Wets. Probabilistic bounds (via large 
deviations) for the solutions of stochastic programming problems. Annals of 
Operations Research, 56:189-208, 1995. 

95. J. N. Kapur and H. K. Kesavan. Entropy Optimization Principles with Appli- 
cations. Academic Press, New York, 1992. 




292 References 



96. M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial 
time. In Proc. of the 15th Int. Conf. on Machine Learning^ pages 260-268, San 
Francisco, 1998. Morgan Kaufmann. 

97. J. D. Kececioglu, H.-R Lenhof, K. Mehlhorn, R Mutzel, K. Reinert, and 
M. Vingron. A polyhedral approach to sequence alignment problems. Disc. 
Appl Math., 104:143-186, 2000. 

98. J. Keith and D. P. Kroese. Sequence alignment by rare event simulation. In 
Proceedings of the 2002 Winter Simulation Conference, pages 320-327, San 
Diego, 2002. 

99. A. I. Khinchin. Information Theory. Dover Publications, Inc., New York, 1957. 

100. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated 
annealing. Science, 220:671-680, 1983. 

101. I. Kovalenko. Approximations of queues via small parameter method. In 
J. H. Dshalalow, editor. Advances in Queueing: Theory, Methods and Open 
Problems, pages 481-509. CRC Press, New York, 1995. 

102. V. Kriman and R. Y. Rubinstein. Polynomial time algorithms for estimation of 
rare events in queueing models. In J. Dshalalow, editor. Frontiers in Queueing: 
Models and Applications in Science and Engineering, pages 421-448, New York, 
1995. CRC Press. 

103. D. P. Kroese and R. Y. Rubinstein. The transform likelihood ratio method for 
rare event simulation with heavy tails. Queueing Systems, 2004. 

104. A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden Markov 
models in computational biology: applications to protein modeling. J. Mol. 
Biol, 235:1501-1531, 1994. 

105. C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and 
J. C. Wootton. Detecting subtle sequence signals: A Cibbs sampling strategy 
for multiple alignment. Science, 262:208-214, 1993. 

106. P. L’Ecuyer. A unified view of the IPA, SF, and LR gradient estimation tech- 
niques. Management Science, 36:1364-1384, 1990. 

107. E. L. Lehmann. Testing Statistical Hypotheses. Springer- Verlag, New York, 
1997. 

108. D. Lieber. Rare- events estimation via cross- entropy and importance sampling. 
PhD thesis, William Davidson Faculty of Industrial Engineering and Manage- 
ment, Technion, Haifa, Israel, 1998. 

109. D. Lieber, R. Y. Rubinstein, and D. Elmakis. Quick estimation of rare events 
in stochastic networks. IEEE Transaction on Reliability, 46:254-265, 1997. 

110. Z. Liu, A. Doucet, and S. S. Singh. The cross-entropy method for blind mul- 
tiuser detection. In IEEE International Symposium on Information Theory, 
Chicago, 2004. 

111. L. Lovasz. Randomized algorithms in combinatorial optimization. DIM ACS 
Series in Discrete Mathematics and Theoretical Computer Science, 25:153-179, 
1995. 

112. S. Mannor, R. Y. Rubinstein, and Y. Cat. The cross-entropy method for fast 
policy search. In The 20th International Conference on Machine Learning 
(ICML-2003), Washington, DC, August 2003. 

113. L. Margolin. Cross-entropy method for combinatorial optimization. Master’s 
thesis. The Technion, Israel Institute of Technology, Haifa, July 2002. 

114. L. Margolin. On the convergence of the cross-entropy method. Annals of 
Operations Research, Kluwer Academic, 2004. 




References 293 



115. G. McLachlan and T. Krishnan. The EM Algorithm and Extensions. John 
Wiley & Sons, New York, 1997. 

116. L. E. Meester and J. G. Shanthikumar. Concavity of the throughput of tan- 
dem queueing systems with finite buffer storage space. Advances in Applied 
Probability, 22:764-767, 1990. 

117. I. Menache, S. Mannor, and N. Shimkin. Basis Function Adaption in Temporal 
Difference Reinforcement Learning. Annals of Operations Research, Kluwer 
Academic, 2004. 

118. M. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and 
E. Teller. Equations of state calculations by fast computing machines. J. 
of Chemical Physics, 21:1087-1092, 1953. 

119. G. A. Mikhailov. Calculation of system parameter derivatives of functionals 
of the solutions to the transport equations. Journal of Computational Mathe- 
matics and Mathematical Physics, 1976. 

120. L. B. Miller. Monte carlo analysis of reactivity coefficients in fast reactors: gen- 
eral theory and applications. Technical report, Argonne National Laboratory, 
IL, 1967. 

121. C. N. Morris. Natural exponential families with quadratic variance functions. 
The Annals of Statistics, 10:65-80, 1982. 

122. S. B. Needleman and C. D. Wunsch. A general method applicable to the 
search for similarities in the amino- acid sequence of two proteins. J. Mol. 
Biol, 48:443-453, 1970. 

123. W. I. Norkin, G. C. Pfiug, and A. Ruszczynski. A branch-and-bound method 
for stochastic global optimization. Working paper. International Institute for 
Applied System Analysis, Laxenburg, Austria, 1996. 

124. I. H. Osman and G. Laporte. Metaheuristics: a bibliography. Annals of Oper- 
ations Research, 63:513-523, 1996. 

125. C. H. Papadimitriou and M. Yannakakis. Optimization, approximation, and 
complexity classes. J. Comput. System Sci., 43:425-440, 1991. 

126. H. T. Papadopoulos and G. A. Vouros. A model management system (MMS) 
for the design and operation of production lines. International Journal of 
production Research, 35(8):2213-2236, 1997. 

127. R. G. Parker and R. L. Rardin. Discrete Optimization. Academic Press, San 
Diego, 1996. 

128. P. A. Pevzner. Multiple alignment, communication cost and graph matching. 
SIAM J. Appl Math., 52:1763-1779, 1992. 

129. J. D. Pinter. Global Optimization in Action. Kluwer Academic, Dordrecht, 
1996. 

130. G. Potamianos and J. Goutsias. Stochastic approximation algorithms for par- 
tition function estimation of Gibbs random fields. IEEE Transactions on In- 
formation Theory, 43(6): 1948-1965, 1997. 

131. C. N. Potts and L. N. van Wassenhove. Singe machine tardiness sequencing 
heuristics. IEEE Transactions, 23:346-354, 1991. 

132. M. Puterman. Markov Decision Processes. Wiley-Interscience, 1994. 

133. T. K. Ralphs, L. Kopman, W. R. Pulleyblank, and L. E. Trotter. On the 
capacitated vehicle routing problem. Mathematical Programming, (B) 94, 2003. 

134. V. J. Rayward-Smith, I. H. Osman, C. R. Reeves, and G. D. Smith. Modem 
Heuristic Search Methods. John Wiley & Sons, Chichester, 1996. 




294 References 



135. C. R. Reeves. Modern heuristic techniques. In V. J. Rayward-Smith, I. H. 
Osman, C. R. Reeves, and G. D. Smith, editors. Modem Heuristic Search 
Methods^ Chichester, 1996. John Wiley &: Sons. 

136. M. I. Reiman and A. Weiss. Sensitivity analysis for simulations via likelihood 
ratios. Operations Research^ 37(5): 830-844, 1989. 

137. A. Ridder. Importance sampling simulations of Markovian reliability systems 
using cross-entropy. Annals of Operations Research^ Kluwer Academic, 2004. 

138. H. E. Romeijn and R. L. Smith. Simulated annealing for constrained global 
optimization. Journal of Global Optimization, 5:101-126, 1994. 

139. M. T. Rosenstein and A. G. Barto. Robot weightlifting by direct policy search. 
In Bernhard Nebel, editor. Proceedings of the Seventeenth International Joint 
Conference on Artificial Intelligence, pages 839-846, Seattle, 2001. Morgan 
Kaufmann. 

140. R. Y. Rubinstein. Some problems in Monte Carlo Optimization. PhD thesis. 
University of Riga, Latvia, 1969. In Russian. 

141. R. Y. Rubinstein. A Monte Carlo method for estimating the gradient in a 
stochastic network. Manuscript, Technion, Israel, 1976. 

142. R. Y. Rubinstein. Simulation and the Monte Carlo Method. John Wiley & 
Sons, New York, 1981. 

143. R. Y. Rubinstein. Monte Carlo Optimization Simulation and Sensitivity of 
Queueing Network. John Wiley &: Sons, New York, 1986. 

144. R. Y. Rubinstein. Optimization of computer simulation models with rare 
events. European Journal of Operational Research, 99:89-112, 1997. 

145. R. Y. Rubinstein. The cross-entropy method for combinatorial and continuous 
optimization. Methodology and Computing in Applied Probability, 2:127-190, 
1999. 

146. R. Y. Rubinstein. Combinatorial optimization, cross-entropy, ants and rare 
events. In S. Uryasev and P. M. Pardalos, editors. Stochastic Optimization: 
Algorithms and Applications, pages 304-358, Dordrecht, 2001. Kluwer. 

147. R. Y. Rubinstein. The cross-entropy method and rare-events for maximal 
cut and bipartition problems. ACM Transactions on Modelling and Computer 
Simulation, 12(l):27-53, 2002. 

148. R. Y. Rubinstein and B. Melamed. Modem Simulation and Modeling. Wiley 
Series in Probability and Statistics, New York, 1998. 

149. R. Y. Rubinstein and A. Shapiro. Discrete Event Systems: Sensitivity Analysis 
and Stochastic Optimization via the Score Function Method. John Wiley & 
Sons, New York, 1993. 

150. J. S. Sadowsky. On the optimality and stability of exponential twisting in 
Monte Carlo simulation. IEEE Trans. Info. Theory, IT-39: 119-128, 1993. 

151. F. Schreiber and C. G5rg. A modified RESTART method using the LRE- 
algorithm. In North Holland, editor. Proceedings of the 14th International 
Teletraffic Congress, pages 787-796, 1994. 

152. F. Schreiber and C. Gorg. The RESTART/LRE method for rare event simula- 
tion. In Proceedings of the 1996 Winter Simulation Conference, pages 390-397, 
1996. 

153. P. K. Sen and J. M. Singer. Large Sample Methods in Statistics. Chapman k, 
Hall/CRC, New York, 1993. 

154. P. Shahabuddin. Rare event simulation of stochastic systems. In Proceedings 
of the 1995 Winter Simulation Conference, pages 178-185, Washington, D.C., 
1995. IEEE Press. 




References 295 



155. J. G. Shanthikumar and D. D. Yao. Monotonicity and concavity properties 
in cyclic queueing networks with finite buffers. In H.G. Perros and T. Altiok, 
editors, Queueing Networks with Blocking^ pages 325-344. Elsevier Science, 
1989. 

156. A. Shapiro. Simulation based optimization- convergence analysis and statistical 
inference. Stochastic Models^ 12:425-454, 1996. 

157. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans- 
actions on Pattern Analysis and Machine Intelligence, 22(8):888-905, 2000. 

158. L. Shi and S. Olafsson. Nested partitioning method for global optimization. 
Operations Research, 48(3):390-407, 2000. 

159. L. Shi, S. Olafsson, and N. Sun. New parallel randomized algorithm for travel- 
ing salesman problem. Computers and Operations Research, 26:371-394, 1999. 

160. D. Siegmund. Importance sampling in the Monte Carlo study of sequential 
tests. Annals of Statistics, 4:673-684, 1976. 

161. T. F. Smith and M. S. Waterman. Identification of common molecular subse- 
quences. J. Mol. Biol., 147:195-197, 1981. 

162. D. G. Stork and E. Yom-Tov. Computer Manual to Accompany Pattern Clas- 
sification. John Wiley & Sons, 2004. 

163. R. Sutton and A. Barto. Reinforcement learning: An introduction. Technical 
report, 1998. 

164. F. Szidarovszky and S. Yakowitz. Principles and Procedures of Numerical 
Analysis. Plenum Press, New York, 1978. 

165. J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Ma- 
chine Learning, 16:185-202, 1994. 

166. M. Villen- Altamirano and J. Villen- Altamirano. RESTART: A method for 
accelerating rare event simulations. In J. W. Cohen and C. D. Pack, editors. 
Proceedings of the 13th International Teletraffic Congress, Queueing, Perfor- 
mance and Control in ATM, pages 71-76, 1991. 

167. M. Villen-Altamirano and J. Villen-Altamirano. About the efficiency of 
RESTART. In Proceedings of the RESIM’99 Workshop, pages 99-128. Uni- 
versity of Twente, the Netherlands, 1999. 

168. G. A. Vouros and H. T. Papadopoulos. Buffer allocation in unreliable produc- 
tion lines using a knowledge based system. Computer & Operations Research, 
25(12):1055-1067, 1998. 

169. I. A. Wagner, M. Lindenbaum, and A. M. Bruckstein. ANTS: Agents on Net- 
works, Trees and Subgraphs. Future Generation Computer Systems, 16(8) :915- 
926, 2000. 

170. A. J. Walker. An efficient method for generating discrete random variables with 
general distributions. Assoc. Comput. Mach. Trans. Math. Software, 3:253-256, 
1977. 

171. C. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge Univer- 
sity, 1989. 

172. A. Webb. Statistical Pattern Recognition. Arnold, London, 1999. 

173. R. Wheeler and K. Narendra. Decentralized learning in finite Markov chains. 
IEEE Trans, on Automatic Control, 31:519-526, 1996. 

174. D. White. Markov Decision Process. John Wiley & Sons, 1992. 

175. R. J. Williams. Simple statistical gradient-following algorithms for connection- 
ist reinforcement learning. Machine Learning, 8:229-256, 1992. 




Index 



n-queens problem, 58, 201, 277 

acceptance-rejection method, 22, 27, 
151, 201 

actual waiting time, 100 
adjacency matrix, 241 
Ali-Silver distance, 13 
alias method, 21, 151, 176 
alignment, 228 
graph, 229 
path, 229 
vector, 229 

Andradottir’s method, 129 
ant colony optimization, 129 
associated stochastic problem, 41, 130, 
132 

asymptotic optimality, 10 
atomic density, 132 

banana data, 263 

base measure, 3 

batch mean, 212 

Bayesian inference, 270 

big-step CE method, 104, 123, 201 

bounded relative error, 9, 80, 95 

buffer allocation problem, 203, 207 

capacitated vehicle routing problem, 
238 

centroid, 260 

change of measure, 32, 37, 59, 63 
change of variable, 91 
clique, 241 
maximal, 241 



maximum, 156, 241 
number, 241 
problem, 240, 247 
clustering problem, 260, 268 
combinatorial optimization, 41 
complexity theory, 9 
composition method, 21, 177 
conditional sampling, 137, 140 
continuous multi-extremal optimization, 
187, 225 

counting measure, 3 
Cramer-Rao inequality, 15, 26 
cross-entropy, 1, 13, 60, 67 
algorithm 

(fully) adaptive, 88, 89, 194 
big-step, 104, 201 
clique, 243, 247 
estimation, 40, 69, 73, 74 
maxcut, 142 
noisy, 205 

optimization, 42, 134, 135 
root-finding, 90 
sequence alignment, 232 
single-phase, 44 

travelling salesman problem, 153 
crude Monte Carlo, 9, 10, 32, 59, 63 
cumulant function, 7 
cumulative distribution function, 3 

degenerate 
density, 132 
transition matrix, 140 
density, 3 
Dirac measure, 42 




298 Index 



discounted reward, 255 
distribution 

(un) bounded support, 5 
Bernoulli, 4 

beta, 4, 24, 28, 93, 122, 200 
binomial, 4 
degenerate, 138 
discrete uniform, 4 
double exponential, 4, 124 
exponential, 4 
exponential family, 6 
finite support, 133 
gamma, 4, 28 
generalized Beta, 157 
geometric, 4 
heavy-tail, 5, 24, 78, 96 
light-tail, 4 
log-normal, 5, 24 
multivariate normal, 201, 262 
normal, 4, 82 
Pareto, 4, 83 
Poisson, 4 
regularly varying, 5 
shifted exponential, 4, 82 
subexponential, 5, 119 
truncated exponential, 4 
truncated normal, 201 
two-parameter, 80 
uniform, 4 
Weibull, 4, 78, 80 
dominating density, 62 
dynamic programming, 255 
dynamic simulation models, 60 

edit distance, 229 
efficient score, 14 
elite samples, 135, 192 
ellipsoid method, 200, 284 
EM algorithm, 248, 249 
entropy, 11 
conditional, 12 
cross-, 13 
differential, 11 
joint, 11 
relative, 13 
Shannon, 11 
expectation, 3 

exponential change of measure, 60 
exponential complexity, 76, 128 



exponential family, 6, 14 
natural -, 7 
exponential twist, 7 
exponential-time estimator, 9 

finite support distribution, 70 
Fisher information, 15, 26 
forward and backward loop, 148 
fully adaptive cross-entropy, 88, 191, 
254 

gambler’s ruin, 126 
gamma function, 4 
genetic algorithm, 129 
GI/G/1 queue, 100, 111, 124 
graph 

complement, 242 
complete, 241 
evolution, 127 
independent set, 242, 247 
minimum vertex cover, 242, 247 
node coloring, 247 
order, 241 
size, 241 

vertex coloring, 247 

hazard rate twisting, 61, 83 
Hessian, 15 

image analysis, 268 
image segmentation, 269 
importance sampling, 32, 36, 62, 63, 67 
independent set, 242 
instability property, 112 
inverse- transform method, 19, 27 
inverse-transform likelihood ratio 
method, 92, 123 

inverse-transform method, 92, 151, 245 

K-means, 261 
fuzzy, 261 

Kullback-Leibler distance, 13, 67 

Lebesgue measure, 3 
level, 36, 72 

likelihood function, 14, 248 
likelihood ratio, 16, 36, 38, 63 
- estimator, 16, 32, 63 
Bindley equation, 100 
hnear vector quantization, 263 




Index 299 



location theory, 155 

logarithmic efficiency, 10, 61, 118 

longest path problem, 140, 186, 203 

M/G/1 queue, 124 
machine scheduling, 234, 237 
Markov chain, 139, 231 
Markov chain with replacement, 150, 
206 

Markov decision process, 254, 268 
shortest path, 256 
Markov policy, 254 
mastermind, 251 

maximal cut problem, 46, 140, 157, 185, 
269 

- with r partitions, 145 
maximum entropy principle, 14 
maximum likelihood 
estimate, 14, 248 
estimator, 14, 44, 81 
method of moments estimator, 25 
moment generating function, 4, 7 
multiple solutions, 162, 170 
mutual information, 12 

natural exponential family, 7, 38, 69, 
126 

nested partitioning method, 129 
network reliability, 126 
neural computation, 251 
node placement, 151, 235 
node transition, 149 
noisy optimization, 203, 257 
nominal 

distribution, 60 
parameter, 33, 64 

optimal degenerate transition matrix, 
150, 186, 206, 217, 244 
optimization, 130 
noisy, 130, 196, 257 
order statistic, 39 

packing problem, 156 
parallel computing, 143 
parameter setting in CE, 135 
partition problem, 145, 158, 179, 211 
performance function, 62 
permutation flow shop problem, 247 



policy, 254 
policy iteration, 255 
Pollaczek-Khinchine formula, 124 
polynomial complexity, 79 
polynomial-time estimator, 9 
probability density function (pdf), 3 
probability distribution, 3 
probability mass function (pmf), 3 

quadratic assignment problem, 154 

random experiment, 2 

random sample, 32, 63 

random vector, 3 

random walk, 102, 125 

rare event, 9, 31, 36, 59, 72 

rarity parameter, 45 

reference parameter, 64, 65, 68 

regenerative method, 100 

reinforcement learning, 255 

relative error (RE), 8, 9 

relative experimental error, 54, 157, 213 

reliability, 126 

response surface, 17 

RESTART, 59 

reward function, 44, 190, 217 
root finding, 90 
Rosenbrock function, 189 

score function, 14, 68 
k-th order, 17 
- method, 16 
sensitivity analysis, 16 
sequence alignment, 227 
shortest path problem, 203 
simulated annealing, 129 
simulation 

conditional Monte Carlo, 61 
dynamic, 40, 100 
Monte Carlo, 32, 59 
rare-event, 31, 72 
RESTART, 59 
splitting, 59 
static, 40, 100 
transient, 104 

simulation-based optimization, 209 
single machine common due date 
problem, 237 

single machine total weighted tardiness 
problem, 234 




300 Index 



smoothed updating, 134 
smoothing scheme, 189, 268 
squared coefficient of variation, 8, 9 
stability number, 242 
standard likelihood ratio, 60, 64, 83, 123 
static simulation models, 60 
stationary policy, 254 
stochastic approximation, 256 
stochastic comparison method, 129 
stochastic counterpart method, 66 
stochastic edge network, 30, 130, 138, 
168 

stochastic node network, 30, 130, 136, 
156 

stochastic optimization, 66, 203 
stochastic process, 3 
stochastic shortest path, 62, 109 
stochastic TSP, 204 

stopping criterion, 35, 66, 144, 147, 153, 
206 

strategy, 254 

stratification, 245 

switching change of measure, 101 

tabu search, 129 



tandem queue, 116 
trajectory generation 
random cuts, 142 
random partitions, 147 
TSP, 149, 152, 174, 177 
transform likelihood ratio, 61, 91, 123 
transition matrix, 139 
travelling salesman problem, 51, 140, 
147, 156, 169, 175, 185, 216 
noisy — , 204 

trigonometric function, 189 
twisting parameter, 7 

unbiased estimator, 8 
unreliable estimate, 193 

value iteration, 255 
variance minimization, 13, 29, 60 
algorithm, 66 
vector quantization, 261 
vehicle routing problem, 238 
vertex cover, 242 

zero-variance estimator, 37, 59 




ALSO AVAILABLE FROM SPRINGER! 




INTRODUCTION TO RARE 
EVENT SIMULATION 

JAMES A. BUCKLEW 

This book presents a unified theory of rare event 
simulation and the variance reduction technique 
known as importance sampling from the point of 
view of the probabihstic theory of large devia- 
tions. This perspective allows us to view a vast 
assortment of simulation problems from a uni- 
fied single perspective. This text keeps the math- 
ematical preliminaries to a minimum with the only 
prerequisite being a single large deviation theo- 
ry result that is given and proved in the text. It 
concentrates on demonstrating the methodology 
and the principal ideas in a fairly simple setting. 
It also includes detailed simulation case studies 
covering a wide variety of application areas 
including statistics, teleconnnunications, and 
queueing systems. 

2004/270 PP./HARDCOVER/ISBN 0-387-20078-9 
SPRINGER SERIES IN STATISTICS 



MONTE CARLO STATISTICAL 
METHODS 

Second Edition 

CHRISTIAN P. ROBERT and GEORGE CASELLA 

The second edition has been revised towards a 
coherent and flowing coverage of these simula- 
tion techniques. This is a textbook intended for 
a second year graduate course, but someone who 
either wants to apply simulation techniques for 
the resolution of practical problems or wishes to 
grasp the fundamental principles behind those 
methods can also use it. Chapters 1-5 cover non- 
Markov Monte Carlo techniques for integration 
and optimization, while Chapters 7-12 provide a 
complete coverage of Markov chain Monte Carlo 
(MCMC) methods. Chapters 13 and 14 provide 
a path to more recent developments. 

2004/680 PP./HARDCOVER/ISBN 0-387-21239-6 
SPRINGER TEXTS IN STATISTICS 



OPTIMIZATION 

KENNETH LANGE 

Finite-dimensional optimization problems occur 
throughout the mathematical sciences. The major- 
ity of these problems cannot be solved analyti- 
cally. This introduction to optimization attempts 
to strike a balance between presentation of math- 
ematical theory and development of numerical 
algorithms. Building on students’ skills in calculus 
and linear algebra, the text provides a rigorous 
exposition without undue abstraction and can serve 
as a bridge to more advanced treatises on non- 
linear and convex programming. The emphasis 
on statistical applications will be especially 
appealing to graduate students of statistics and 
biostatistics. The intended audience also includes 
graduate students in applied mathematics, 
computational biology, computer science, eco- 
nomics, and physics as well as upper division 
undergraduate majors in mathematics who want 
to see rigorous mathematics combined with 
real applications. 

2004/262 PP./HARDCOVER/ISBN 0-387-20332-X 
SPRINGER TEXTS IN STATISTICS 



Ta Order or for Information: 

(n the Americas: CALU 1^00-SPRINGER or 
FAfc ^201) 348-4505 ■ WRfTEi Springer, Dept. 
S7S25, PO Bojt 2485> Secaucgs, NJ 07096-2485 

• VISIT: Your loc^ tectinlcal txwkstore 

• E-MAJL;orders#$pri ngerny.com 

Outside the Americes: CALL- +49 10} 6221 345-217/8 

• FAX: + 49 (0) 6221 345-229 • WRUE: Springer 
Customer Service. Haberstrasse 7. 69126 
HelrJelberg, Germany • E-MAIL ordefs@sprtngef.de 



springeronllne.com 



^ Spri 



ringer 



the language of science 





